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To Andy, Ella, and Joo 


Preface 


It was a very different, unprecedented, and weird start of the semester, and I did 
not know what to do. This semester, I was supposed to offer a new senior-level 
undergraduate class on Advanced Intelligence to jointly teach students at the Depart- 
ment of Bio/Brain Engineering and the Department of Mathematical Sciences. I 
had initially planned a standard method for teaching machine learning, the contents 
of which are practical, experience-based lectures with a lot of interaction with the 
students through many mini-projects and term projects. Unfortunately, the global 
pandemic of COVID-19 has completely changed the world and such interactive 
classes are no longer an option most of the time. 

So, I thought about the best way to give online lectures to my students. I 
wanted my class to be different from other popular online machine learning courses 
but still provide up-to-date information about modern deep learning. However, 
not many options were available. Most existing textbooks are already outdated or 
very implementation oriented without touching the basics. One option would be to 
prepare presentation slides by adding all the up-to-date knowledge that I wanted to 
teach. However, for undergraduate-level courses, the presentation files are usually 
not enough for students to follow the class, and we need a textbook that students 
can read independently to understand the class. For this reason, I decided to write 
a reading material first and then create presentation files based on it, so that the 
students can learn independently before and after the online lectures. This was the 
start of my semester-long book project on Geometry of Deep Learning. 

In fact, it has been my firm belief that a deep neural network is not a magic black 
box, but rather a source of endless inspiration for new mathematical discoveries. 
Also, I believed in the famous quote by Isaac Newton, “Standing on the shoulders 
of giants,” and looking for a mathematical interpretation of deep learning. For me 
as a medical imaging researcher, this topic was critical not only from a theoretical 
point of view but also for clinical decision-making, because we do not want to create 
false features that can be recognized as diseases. 

In 2017, on a street in Lisbon, I had Eureka! moment in understanding hidden 
framelet structure in encoder-decoder neural networks. The resulting interpretation 
of the deep convolutional framelets, published in the SJAM Journal of Imaging 


Vii 
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Science, has had a significant impact on the applied math community and has 
been one of the most downloaded papers since its publication. However, the role 
of the rectified linear unit (ReLU) was not clear in this work, and one of the 
reviewers in a medical imaging journal consistently asked me to explain the role 
of the ReLU in deep neural networks. At first, this looked like a question that went 
beyond the scope of the medical application paper, but I am grateful to the reviewer, 
as during the agony of preparing the answers to the question, I realized that the 
ReLU determines the input space partitioning, which is automatically adapted to 
the input space manifold. In fact, this finding led to a 2019 ICML paper, in which 
we revealed the combinatorial representation of framelets, which clearly shows the 
crucial connection with the classic compressed sensing (CS) approaches. 

Looking back, I was pretty brave to start this book project, as these are just two 
pieces of my geometric understanding of deep learning. However, as I was preparing 
the reading material for each subject of deep learning, I found that there are indeed 
many exciting geometric insights that have not been fully discussed. 

For example, when I wrote the chapter on backpropagation, I recognized the 
importance of the denominator layout convention in the matrix calculus, which 
led to the beautiful geometry of the backpropagation. Before writing this book, 
the normalization and attention mechanisms looked very heuristic to me, with 
no evidence of a systematic understanding that is even more confusing due to 
their similarities. For example, AdaIN, Transformer, and BERT were like dark 
recipes that researchers have developed with their own secret sauces. However, 
an in-depth study for the preparation of the reading material has revealed a very 
nice mathematical structure behind their intuition, which shows a close connection 
between them and their relationship to optimal transport theory. 

Writing a chapter on the geometry of deep neural networks was another joy 
that broadened my insight. During my lecture, one of my students pointed out that 
some partitions can lead to a low-rank mapping. In retrospect, this was already in 
the equation, but it was not until my students challenged me that I recognized the 
beautiful geometry of the partition, which fits perfectly with fascinating empirical 
observations of the deep neural network. 

The last chapter, on generative models and unsupervised learning, is something 
of which I am very proud. In contrast to the conventional explanation of the gener- 
ative adversarial network (GAN), variational auto-encoder (VAE), and normalizing 
flows with probabilistic tools, my main focus was to derive them with geometric 
tools. In fact, this effort was quite rewarding, and this chapter clearly unified various 
forms of generative model as statistical distance minimization and optimal transport 
problems. 

In fact, the focus of this book is to give students a geometric insight that can 
help them understand deep learning in a unified framework, and I believe that this is 
one of the first deep learning books written from such a perspective. As this book is 
based on the materials that I have prepared for my senior-level undergraduate class, I 
believe that this book can be used for one-semester-long senior-level undergraduate 
and graduate-level classes. In addition, my class was a code-shared course for 
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both bioengineering and math students, so that much of the content of the work 
is interdisciplinary, which tries to appeal to students in both disciplines. 

Iam very grateful to my TAs and students of the 2020 spring class of BiS400C 
and MAS480. I would especially like to thank my great team of TAs: Sangjoon 
Park, Yujin Oh, Chanyong Jung, Byeongsu Sim, Hyungjin Chung, and Gyutaek 
Oh. Sangjoon, in particular, has done a tremendous job as Head TA and provided 
organized feedback on the typographical errors and mistakes of this book. I would 
also like to thank my wonderful team at the Bio Imaging, Signal Processing 
and Learning laboratory (BISPL) at KAIST, who have produced ground-breaking 
research works that have inspired me. 

Many thanks to my awesome son and future scientist, Andy Sangwoo, and my 
sweet daughter and future writer, Ella Jiwoo, for their love and support. You are my 
endless source of energy and inspiration, and I am so proud of you. Last, but not the 
least, I would like to thank my beloved wife, Seungjoo (Joo), for her endless love 
and constant support ever since we met. I owe you everything and you made me a 
good man. With my warmest thanks, 


Daejeon, Korea Jong Chul Ye 
February, 2021 
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Part I 
Basic Tools for Machine Learning 


“T heard reiteration of the following claim: Complex theories do not work; simple 
algorithms do. I would like to demonstrate that in the area of science a good old 
principle is valid: Nothing is more practical than a good theory.” 


—Vladimir N Vapnik 


Chapter 1 ®) 
Mathematical Preliminaries Cheek for 


In this chapter, we briefly review the basic mathematical concepts that are required 
to understand the materials of this book. 


1.1 Metric Space 


A metric space (X, d) is a set X together with a metric d on the set. Here, a metric 
is a function that defines a concept of distance between any two members of the set, 
which is formally defined as follows. 


Definition 1.1 (Metric) A metric on a set X is a function called the distance d : 
Xx X +> Rx, where Ry is the set of non-negative real numbers. For all x, y, z € X, 
this function is required to satisfy the following conditions: 


1. d(x, y) => 0 (non-negativity). 

2. d(x, y) = Oif and only if x = y. 

3. d(x, y) = d(y, x) (symmetry). 

4. d(x, z) < d(x, y) + dy, Z) (triangle inequality). 


A metric on a space induces topological properties like open and closed sets, which 
lead to the study of more abstract topological spaces. Specifically, about any point 
x in a metric space X, we define the open ball of radius r > 0 about x as the set 


B,(x) ={y € X: d(x, y) <r}. (1.1) 


Using this, we have the formal definition of openness and closedness of a set. 


Definition 1.2 (Open Set, Closed Set) A subset U € X is called open if for every 
x € U there exists anr > 0 such that B, (x) is contained in U. The complement of 
an open set is called closed. 
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A sequence (x,) in a metric space X is said to converge to the limit x € X if and 
only if for every ¢ > 0, there exists a natural number N such that d(x,, x) < ¢ for 
alln > N. A subset S of the metric space X is closed if and only if every sequence 
in S that converges to a limit in X has its limit in S. In addition, a sequence of 
elements (x,) is a Cauchy sequence if and only if for every ¢« > 0, there is some 
N > 1 such that 


d(xXn,Xm)<é& Vo mn>N. 


We are now ready to define the important concepts in metric spaces. 


Definition 1.3 (Completeness) A metric space X is said to be complete if every 
Cauchy sequence converges to a limit; or if d(xn,Xm) —> O as both n and m 
independently go to infinity, then there is some y € X with d(xn, y) > 0. 


Definition 1.4 (Lipschitz Continuity) Given two metric spaces (X,dx) and 
(Y, dy), where dx denotes the metric on the set X and dy is the metric on set 
Y, a function f : X & Y is called Lipschitz continuous if there exists a real 
constant K > 0 such that, for all x1, x1 € X, 


dy(f (x1), f(%2)) S Kdx(x1, x2). (1.2) 


Here, the constant K is often called the Lipschitz constant, and a function f with 
the Lipschitz constant K is called K-Lipschitz function. 


1.2 Vector Space 


A vector space V is a set that is closed under finite vector addition and scalar 
multiplication. In machine learning applications, the scalars are usually members of 
real or complex values, in which case ‘V is called a vector space over real numbers, 
or complex numbers. 

For example, the Euclidean n-space R” is called a real vector space, and C” 
is called a complex vector space. In the n-dimensional Euclidean space R”, every 
element is represented by a list of n real numbers, addition is component-wise, and 
scalar multiplication is multiplication on each term separately. More specifically, we 
define a column n-real-valued vector x to be an array of n real numbers, denoted by 


Xn 
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where the superscript ' denotes the adjoint. Note that for a real vector, the adjoint 
is just a transpose. Then, the sum of the two vectors x and y, denoted by x + y, is 
defined by 


= 
x+y=[xi ty x2+y2°-- Xn t+ yn] 


Similarly, the scalar multiplication with a scalar a € R is defined by 


ax = [oxy x2 --- ax | 


In addition, we formally define the inner product and the norm in a vector space 
as follows. 


Definition 1.5 (Inner Product) Let V be a vector space over R. A function 
(,-)y:V x Vt Ris an inner product on V if: 


1. Linear: (a1 f; +o02fo,g)v =a1(f1, g)v +02(fo, g)v for all a}, a2 € Rand 
fi, f2a8 EV. 

2. Symmetric: (f, g)v = (g, f)v. 

3. (f, flv = Oand (f, f)v = 0 if and only if f = 0. 


If the underlying vector space ‘V is obvious, we usually represent the inner product 
without the subscript V, i.e. (f, g). For example, the inner product of the two 
vectors f, g € R” is defined as 


(f=) fear se. 
i=l 


Two nonzero vectors x, y are called orthogonal when 


(x, y) =0, 


which we denote as x L y. A vector x is orthogonal to a subset S C ‘V, denoted by 
x L S, if it is orthogonal to every element of S. The orthogonal complement of S, 
denoted by S*+, consists of all vectors in V that are orthogonal to every vector in S, 
Le. 


St ={x eV: (v,x) =0, Woe Sh. 
Definition 1.6 (Norm) A norm || - || is a real-valued function defined on the vector 


space that has the following properties: 


1. ||x|| = 0, and ||x|| = 0 if and only if x = 0. 
2. |lax|| = |e|||x|| for any scalar a. 
3. Triangular inequality: ||x + y|| < ||x|| + ||y|| for any vectors x and y. 
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From the inner product, we can obtain the so-called induced norm: 


I|x|| = Vv (x, x). 


Similarly, the definition of the metric in Sect. 1.1 informs us that a norm in a vector 
space ‘V induces a metric, i.e. 


d(x,y)=|_x—yll, x ye. (1.3) 


The norm and inner product in a vector space have special relations. For example, 
for any two vectors x, y € ‘V, the following Cauchy—Schwarz inequality always 
holds: 


I(x, y)| S lle illyl. (1.4) 


1.3. Banach and Hilbert Space 


An inner product space is defined as a vector space that is equipped with an inner 
product. A normed space is a vector space on which a norm is defined. An inner 
product space is always a normed space since we can define a norm as || f'|| = 
V(f, f), which is often called the induced norm. Among the various forms of the 
normed space, one of the most useful normed spaces is the Banach space. 


Definition 1.7 The Banach space is a complete normed space. 


Here, the “completeness” is especially important from the optimization perspective, 
since most optimization algorithms are implemented in an iterative manner so that 
the final solution of the iterative method should belong to the underlying space H. 
Recall that the convergence property is a property of a metric space. Therefore, the 
Banach space can be regarded as a vector space equipped with desirable properties 
of a metric space. Similarly, we can define the Hilbert space. 


Definition 1.8 The Hilbert space is a complete inner product space. 


We can easily see that the Hilbert space is also a Banach space thanks to the 
induced norm. The inclusion relationship between vector spaces, normed spaces, 
inner product spaces, Banach spaces and Hilbert spaces is illustrated in Fig. 1.1. 

As shown in Fig. 1.1, the Hilbert space has many nice mathematical structures 
such as inner product, norm, completeness, etc., so it is widely used in the machine 
learning literature. The following are well-known examples of Hilbert spaces: 


* /°(Z): a function space composed of square summable discrete-time signals, i.e. 


l=—00 


PO=a.0= Giles | >: ar «oof. 
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Reproducing Kernel Hilbert Space 


Inner Product 
Space 


Normed Space 


Vector Space 


Fig. 1.1 RKHS, Hilbert space, Banach space, and vector space 
Here, the inner product is defined as 
[o,@) 
(x.y) = So xy, Ve.y eH. (1.5) 
l=—00 


* L?(R): a function space composed of square integrable continuous-time signals, 
Le. 


L?(R) = {x | - |x(t)|-dt < ooh 


—c 


Here, the inner product is defined as 


(X,Y) = [soya (1.6) 


Among the various forms of the Hilbert space, the reproducing kernel Hilbert space 
(RKHS) is of particular interest in the classical machine learning literature, which 
will be explained later in this book. Here, the readers are reminded that the RKHS is 
only a subset of the Hilbert space as shown in Fig. 1.1, i.e. the Hilbert space is more 
general than the RKHS. 


1.3.1 Basis and Frames 


The set of vectors {x,,--- ,x,z} is said to be linearly independent if a linear 
combination denoted by 


1X1 +a2x2+---+a0px, = 0 
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implies that 


The set of all vectors reachable by taking linear combinations of vectors in a set S 
is called the span of S. For example, if S = {x; Vs , then we have 


k 
span(S) = {oo € “| . 


i=1 


A set B = {b;}""_, of elements (vectors) in a vector space V is called a basis, 
if every element of V may be written in a unique way as a linear combination of 
elements of 8, that is, for all f € ‘V, there exists unique coefficients {c;} such that 


f= oe (1.7) 
i=l 


A set 8 is a basis of V if and only if every element of 8 is linearly independent 
and span(8) = VV. The coefficients of this linear combination are referred to as 
expansion coefficients, or coordinates on 8 of the vector. The elements of a basis 
are called basis vectors. In general, for m-dimensional spaces, the number of basis 
vectors is m. For example, when V = R?, the following two sets are some examples 


LILI 


For function spaces, the number of basis vectors can be infinite. For example, for 
the space V7 composed of periodic functions with the period of T, the following 
complex sinusoidals constitute its basis: 


00 j 2mnt 
B={GnO}n=-0o Ont) =e, (1.9) 
so that any function x(t) € Vr can be represented by 
[o,@) 
x)= Yo angn(t), (1.10) 
n=—C 
where the expansion coefficient is given by 
1 * 
an == | x(t)o,(t)dt. (1.11) 
T Jr 


In fact, this basis expansion is often called the Fourier series. 
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Unlike the basis, which leads to the unique expansion, the frame is composed 
of redundant basis vectors, which allows multiple representations. For example, 
consider the following frame in R?: 


mme(EEE) oe 


Then, we can easily see that the frame allows multiple representations of, for 
example, x = [2, 3]' as shown in the following: 


x = 20; +302 = v2 + 203. (1.13) 


Frames can also be extended to deal with function spaces, in which case the number 
of frame elements is infinite. 
Formally, a set of functions 


= [Ox ]ker = [- se Py_1 Py ---| 


in a Hilbert space H is called a frame if it satisfies the following inequality [1]: 


all fl? < OF. O)P < BFP, VF eH, (1.14) 


keP 


where a, 8 > 0 are called the frame bounds. If w = 6, then the frame is said to be 
tight. In fact, the basis is a special case of tight frames. 


1.4 Probability Space 


We now start with a formal definition of a probability space and related terms from 
the measure theory [2]. 


Definition 1.9 (Probability Space) A probability space is a triple (Q, ¥, j2) con- 
sisting of the sample space Q, an event space # composed of a subset of Q 
(which is often called o-algebra), and the probability measure (or distribution) 
fu: Fe [O, 1], a function such that: 


* £ must satisfy the countable additivity property that for all countable collections 
{E;} of pairwise disjoint sets: 


p(U; Ei) = Ui (Ei); 


e the measure of the entire sample space is equal to one: j4(Q) = 1. 
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In fact, the probability measure is a special case of the general “measure” in 
measure theory [2]. Specifically, the general term “measure” is defined similarly to 
the probability measure defined above except that only positivity and the countable 
additivity property are required. Another important special case of a measure is the 
counting measure v(A), which is the measure that assigns its value as the number 
of elements in the set A. 

To understand the concept of a probability space, we give two examples: one for 
the discrete case, the other for the continuous one. 


Example (Discrete Probability Space) 

If the experiment consists of just one flip of a fair coin, then the outcome is 
either heads or tails: {H, T}. Hence, the sample space is 2 = {H, T}. The o- 
algebra or the event space contains 2? = 4 events, namely: {H} (“heads”), 
{T} (‘tails’), @ (“neither heads nor tails”), and {H,T} (“either heads or 
tails”); in other words, f¥ = {@, {H}, {T}, {H, T}}. There is a 50% chance of 
tossing heads and 50% for tails, so the probability measure in this example is 
P(@=0), P({H}) = 0.5, P({T}) = 0.5, P({H, T}) = 1. 


Example (Continuous Probability Space) 

A number between 0 and | is chosen at random, uniformly. Here Q2 = [0, 1]. 
In this case, the event space F can be generated by: (1) the open intervals (a, b) 
on [0, 1]; Gi) the closed intervals [a, b]; (iii) the closed half-lines [0, a], and 
their union, intersection, complement, and so on. Finally, the measure ju is the 
Lebesgue measure, defined as the sum of the lengths of the intervals contained 
in Ff, i.e. 4([0.2, 0.5]) = 0.3, w([0, 0.2) U [0.5, 0.8]) = 0.5, w({0.5}) = 0. 


We now define the Radon—Nikodym derivative, which is a mathematical tool 
to derive the probability density function (pdf) for the continuous domain, or 
probability mass function (pmf) for the discrete domain in a rigorous setting. This 
is also important in deriving the statistical distances, in particular, the divergences. 
For this, we need to understand the concept of an absolutely continuous measure. 


Definition 1.10 (Absolutely Continuous Measure) If j and v are two measures 
on any event set ¥ of Q, we say that v is absolutely continuous with respect to jw, or 
v < p, if for every measurable set A, (A) = 0 implies v(A) = 0. 


Theorem 1.1 (Radon-Nikodym Theorem) Let i and v be two measures on any 
event set F of Q. If hk < v, then there exists a non-negative function g on Q such 
that 


uA) = | dn=f gar, AEF. (1.15) 
A A 
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The function g is called the Radon—Nikodym derivative or density of A w.r.t. v and 
is denoted by dA /dv. One of the popular Radon—Nikodym derivatives in probability 
theory is the probability density function (pdf) or probability mass function (pmf) 
as discussed below. 

For a probability space (Q, F, 4), a random variable is defined as a function 
X : Qt M from a set of possible outcomes Q to a measurable space M. For the 
random variable X, we can now define the mean for its functions: 


E,Lg(X)] = [ seduce. (1.16) 


1.5 Some Matrix Algebra 


In the following, we introduce some matrix algebra that is useful in understanding 
the materials in this book. 

A matrix is a rectangular array of numbers, denoted by an upper case letter, say 
A. A matrix with m rows and n columns is called an m x n matrix given by 


411 412 *** Gin 
421 422 +--+ a2n 
Gm\ Am2°*** Gmn 


The k-th column of matrix A is often denoted by ay. The maximal number of 
linearly independent columns of A is called the rank of the matrix A. It is easy 
to show that 


Rank(A) = dim span ([a1,--- ,@n]). 
The trace of a square matrix A € R”*”, denoted Tr(A) is defined to be the sum of 
elements on the main diagonal (from the upper left to the lower right) of A: 


Tr(A) = a 


i=1 
Definition 1.11 (Range Space) The range space of a matrix A € R”*”, denoted 
by R(A), is defined by R(A) := {Ax | Vx € R”}. 


Definition 1.12 (Null Space) The null space of a matrix A € R”*”, denoted by 
N(A), is defined by N(A) := {x € R” | Ax = 0}. 
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A subset of a vector space is called a subspace if it is closed under both addition and 
scalar multiplication. We can easily see that the range and null spaces are subspaces. 
Moreover, we can show the following fundamental property: 


R(A)t = NA!) N(A)t=R(A'). (1.17) 


If a vector space V is Hilbert space, then it is known that for a subspace S « V 
and the vector y € ‘V, the point in S that is closest to y exists and is unique, and 
given by 

y=Psy 
where fs is the projector associated with the subspace S. In particular, if the 
subspace S has a basis B, then the projector for S is given by 


Ps = B(B' B)'B". 


The eigen-decomposition of a square matrix is defined as follows. 
Definition 1.13 (Eigen-Decomposition) A (nonzero) vector v € C” is an eigen- 
vector of a square matrix A € C”™” if it satisfies the linear equation 


Av =v, (1.18) 


where A is a scalar, termed the eigenvalue corresponding to v. 
We now define the singular value decomposition (SVD) of A. 


Theorem 1.2 (SVD Theorem) Jf A € C”%” is a rank r matrix, then there exist 
matrices U € C”*" and V € C"*" such that U'U = V'V = 1, and A = 
UV", where I, is the r x r identity matrix and ¥ is anr x r diagonal matrix 
whose diagonal entries, called singular values, satisfy 


01 >02>::->0,>0. 


The decomposition can be written as 


O71 0. 0 

0 02 cr : ab z T 
A=[u,--- u;| ; [v1 ss vy} =) oxucn,, 

reece. | k=1 

Ou» 0 


where ux and vx are called left singular vectors and right singular vectors, 
respectively. 
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Using the SVD, we can easily show the following: 
Pray =UU', Peat) = VV". (1.19) 


Using the SVD, we can define the matrix norm. Among the various forms of matrix 
norms for a matrix X € R"*", the spectral norm || X ||2 and the nuclear norm || X ||x 
are quite often used, which are defined by 


|X ll2 = Omax(X) = Amax(X' X))', (1.20) 


Xl = Do oi(X) = DOAK XY)”, (1.21) 


t 


where Omax(-) and Amax(-) denote the largest singular value and eigenvalue, 
respectively. 
The following matrix inversion lemma [3] is quite useful. 


Lemma 1.1 (Matrix Inversion Lemma) 


=i 
(d+UCV)-! =1-Uu(c"'+Vvu) V. (1.22) 


-1 
(A+ UCV)=A-!-A lu (c LL VA 'v) VAq!, (1.23) 


1.5.1 Kronecker Product 


In mathematics, the Kronecker product, sometimes denoted by ®, is an operation 
on two matrices of arbitrary size resulting in a block matrix. The formal definition 
is given as follows. 


Definition 1.14 (Kronecker Product) If A is an m x n matrix and B is a p x q 
matrix, then the Kronecker product A ® B is the pm x qn block matrix: 


aiB --- ay,B 
A@B= a : (1.24) 


am B +++ Amn B 


The Kronecker product has many important properties, which can be exploited to 
simplify many matrix-related operations. Some of the basic properties are provided 
in the following lemma. The proofs of the lemmas are straightforward, which can 
easily be found from a standard linear algebra textbook [4]. 
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Lemma 1.2 
A@(B+C)=AQB+AQC. (1.25) 
(B+C)@A=BQA+C@A. (1.26) 
AQBEBOA. (1.27) 
(A@ B)@C=A@Q(B@C). (1.28) 
(A@B)'=A'@B'. (1.29) 
(A®B)!'=A'@B!. (1.30) 


Lemma 1.3 If A, B,C and D are matrices of such a size that one can form the 
matrix products AC and BD, then 


(A @ B)(C ® D) = AC @ BD. (1.31) 


One of the important usages of the Kronecker product comes from the vectorization 
operation of a matrix. For this we first define the following two operations. 


Definition 1.15 If A = [a, --- a,] € R”*", then 


a| 
VEC(A) =| : | €R™, (1.32) 
an 
a\ 
UNVEC(VEC(A)) = UNVEC : =A. (1.33) 
an 


From these definitions, we can obtain the following two lemmas which will be 
extensively used here. 


Lemma 1.4 ([4]) For the matrices A, B, C with appropriate sizes, we have 
VEC(CAB) = (B! @ C)VEC(A), (1.34) 


where VEC(-) is the column-wise vectorization operation. 


Lemma 1.5 For the vectors x € R”, y € R", we have 
Vec(xy!) = (y @ Im)x, (1.35) 


where I, denotes the m x m identity matrix. 
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Proof By plugging C = Im, A = x and B = y' into (1.34), we conclude the 
proof. oO 


1.5.2. Matrix and Vector Calculus 


In computing a derivative of a scalar, vector, or matrix with respect to a scalar, 
vector, or matrix, we should be consistent with the notation. In fact, there are two 
different conventions: numerator layout and denominator layout. For example, for a 
given scalar y and a column vector x = [x1,---, X,]' € R", the numerator layout 
has the following convention: 


Ox 
y 
dy _ [ Chy dy | ax a i 
Ox] OXn |? ~ 3 ’ 
Ox dy din 
dy 


implying that the number of the row follows that of the numerator. On the other 
hand, the denominator layout notation provides 


hy 

Oxy, 
dy _ : Ox — ies Fe 
a : > — dy dy |’ 
Ox iy dy 

OXn 


where the number of resulting rows follows that of the denominator. Either layout 
convention is okay, but we should be consistent in using the convention. 

Here, we will follow the denominator layout convention. The main motivation 
for using the denominator layout is from the derivative with respect to the matrix. 
More specifically, for a given scalar c and a matrix W € R”*", according to the 
denominator layout, we have 


dc... _dc 
dc awit IWin 
Ca o Pe ER”. (1.36) 
dc... _dc 
Wm OWmn 


Furthermore, this notation leads to the following familiar result: 


= =a. (1.37) 
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Accordingly, for a given scalar c and a matrix W € R’”™”, we can show that 


dc 0c 
2¢ _ unvec (—“° —) eR™", 1.38 
rT cas (acm) . os 


in order to be consistent with (1.36). Under the denominator layout notation, for 
given vectors x € R” and y € R", the derivative of a vector with respect to a vector 
is given by 


Oyt .., OYn 
9 Ox] Ox] 
aa a5 8 . mxn 
ee=[ ic. i [er™. (1.39) 
x a : 
ay1 erst 9Yn 
OXm OXm 


Then, the chain rule can be specified as follows: 


dc(g(u)) _ du Ig(u) dc(g) 


= 1.40 
ox ox ou dg ( ) 
Eq. (1.37) also leads to 
0A 
aw ee (1.41) 
Ox 
Finally, the following result is useful. 
Lemma 1.6 Let A € R”*” and x € R". Then, we have 
dAx 
2 es 1.42 
dVEC(A) Bun a2) 


Proof Using Lemma 1.4, we have Ax = VEC(Ax) = (x! @ Im)VEC(A). Thus, 
dAx A(x! @ Im) VEC(A) 
OVEC(A) dVEC(A) 
@ @1,)" 
x @ Im, (1.43) 


where we use (1.37) and (1.29) for the second and the third equalities, respectively. 
Q.E.D. oO 
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Lemma 1.7 ((5]) Let x,a and B denote vectors and a matrix with appropriate 
sizes, respectively. Then, we have 


oO ee (1.44) 
ax' Bx + 
. =(B+B')x. (1.45) 


For a given scalar function £ : x € R” + R, the derivative is often called the 
gradient, which can be represented by the denominator layout: 


ae 
Vi := — ER". 
ox 
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1.6.1 Some Definitions 


Let X, Y and Z be non-empty sets. The identity operator on H is denoted by J, i.e. 
Ix =x,Vx € H. Let D C Hbe a non-empty set. The set of the fixed points of an 
operator 7: D +> D is denoted by 

Fix7 = {x € D| Tx =x}. 


Let X and Y be real normed vector space. As a special case of an operator, we define 
a set of linear operators: 


B(X,Y) ={F: Xt Y | Tis linear and continuous} 


and we write B(X) = B(X, X). Let f : X + [—00, oo] be a function. The domain 
of f is 


domf = {x € X| f(x) < ox}, 
the graph of f is 
graf = {(x,y) eX x Rf) = yh, 
and the epigraph of f is 


epif ={(x,y):x eX, yeR, y= f(x)}. 
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The indicator function ic : X > [—o0, co] of C C &X is defined as 


ic(x) = Ve nee (1.46) 
oo, otherwise. 
We often use another definition of the indicator function: 
lifxec 
= ; : 1.47 
xo@) 0, otherwise. ne) 


The support function of a set C is defined as 
Se(x) = sup{(x, y)|y € C}. 
An affine function is denoted by 
xHeIx+b, xEX,yEeYV,TE BX,Y). 


A function f is called lower semicontinuous at xo if for every ¢ > O there exists 
a neighbourhood U of xo such that f(x) > f(xo) — e for all x € WU. This is 
expressed as 


liminf f(x) > f (x0). 
x>Xx0 


A function is lower semicontinuous if and only if all of its lower level sets {x € 
X : f(x) < a} are closed. Alternatively, f is lower semicontinuous if and only if 
the epigraph of f is closed. A function is proper if —oo ¢ f(X) and domf 4 YG 
(Fig. 1.2). 

An operator A: Ht» His positive semidefinite if and only if 


(x, Ax) >0, Wx EH. 


(a) 


Fig. 1.2 Epigraphs for (a) a lower semicontinuous function, and (b) a function which is not lower 
semicontinuous 
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An operator A: Ht> His positive definite if and only if 
(x, Ax) >0, Vx EH. 


For simplicity, we denote A > 0 (resp. A > 0 ) for positive semidefinite (resp. 
positive definite) operators. If A : C" +> C”, then S7., and S" denote the set of 
n X n positive definite and semipositive definite matrices, respectively. Here, the 
eigenvalues of positive semidefinite (resp. positive definite) are all real and non- 
negative (resp. positive). 


1.6.2. Convex Sets, Convex Functions 


A function f(x) is a convex function if domf is a convex set and 


f(Ox1 + (Ll — @)x2) < Of (¥1) + A — 6) f (x1) 


for all x1, x2 € domf,0 < 0 < 1.A convex set is a set that contains every line 
segment between any two points in the set (see Fig. 1.3). Specifically, a set C is 
convex if x1, x2 € C, then 6x; + (1 — 8)x2 € C for all 0 < 6 < 1. The relation 
between a convex function and a convex set can also be stated using its epigraph. 
Specifically, a function f(x) is convex if and only if its epigraph epif is a convex 
set. 

Convexity is preserved under various operations. For example, if {fj}jcy is 
a family of convex functions, then, sup;_; f; is convex. In addition, a set of 
convex functions is closed under addition and multiplication by strictly positive real 
numbers. Moreover, the limit point of a convergent sequence of convex functions is 
also convex. Important examples of convex functions are summarized in Table 1.1. 


Fig. 1.3. A convex set anda 
convex function 


Non-convex set 


Convex function 


Convex set 


20 


Table 1.1 Examples of convex functions 
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Name 


Exponential 


f(x) 
e*, VWaeR 


Quadratic over linear 


/y, Gy) ERxRyy 


Huber function 


Ix|?/2p, if |x| <p 
Ix] — w/2, if |x| > 


Relative entropy 


ylogy—ylogx, (,y) 


eRyy x Ray 


Indicator function 


ic(x), C: convex set 


Support function 


} 


Sc(x) = sup{(x, y)|y eC 


Distance to a set 


d(x, S) = infyes ||x — yl 


Affine function 


Tx+b, xe€R". 


Quadratic function 


x'Qx/2, x ER", Qe 


S+ 


p-norms 


1 
lixllp = (Xj beil?)"””, p= 1 


Io9-norm 


IX Ilo = max; |x;| 


Max function 


max{Xx1, +++ , Xn} 


Log-sum-exponential 


log (Soy ev), x =(x1,-°° 


x) ER", 


Gaussian data fidelity 


ly —Ax|?, x EH 


Poisson data fidelity 


(1, Ax) — (y, log(Ax)), 


xeR".,1=(,---, DER’ 


Spectral norm 


IX lo = Omax(X) = (max (XT X))!/2, XeR™ 


Nuclear norm 


|Xllx = 0; o1(X) = 1; A(X'X))/?, X eR 


Table 1.2 Examples of 
concave functions 


Name f() 

Powers xP, O<p<l, x eR 
1 

Geometric mean (iat xi) a 

Logarithm logx, x ERyy 

Log determinant | logdet(X), X € S44 


A function f is concave if — f is convex. It is easy to show that an affine function 
f(x) = Ax + b is both convex and concave. Examples of concave functions that 
are often used in this textbook can be found in Table 1.2. 


1.6.3 Subdifferentials 


The directional derivative of f atx € dom in the direction of y € His defined by 


/ : =i 
f(y) ne 


f(x +ay) — f(x) 


a 


(1.48) 
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if the limit exists. If the limit exists for all y € H, then one says that f is Gateaux 
differentiable at x. Suppose f’ (x; -) is linear and continuous on H.. Then, there exist 
a unique gradient vector V f(x) € H such that 


fix y)=(y,VF()), Vy eH. 


If a function is differentiable, the convexity of a function can easily be checked 
using the first- and second-order differentiability, as stated in the following: 


Proposition 1.1 Let f : Ht (—00, co] be proper. Suppose that domf is open 
and convex, and f is Gdteux differentiable on domf. Then, the followings are 
equivalent: 


1. f is convex. 
2. (First-order): f(y) => f(x) +(y—x,Vf(x)), Wx, y €H. 
3. (Monotonicity of gradient): (y —x, Vf(y) —-Vf(x)) =>0, Vx, y EH. 


If the convergence in (1.48) is uniform with respect to y on bounded sets, i.e. 


tim 1 +9) — f@&) —y, VI) 
im 


=; (1.49) 
0Ay>0 lly 


then f is Fréchet differentiable and V f (x) is called the Fréchet gradient of f at x. 
If f is differentiable and convex, then it is clear that 


x eargmnf & Vf(x) =0. 


However, if f is not differentiable, we need a more general framework to character- 
ize the minimizers. The sub-differential of f is a set-valued operator defined as 


Of(x)={ueH: f(y) = fx) +(y—x,u), Vy eH}. (1.50) 


The elements of sub-differential df (x) are called sub-gradients of f at x. Another 
important role of the subdifferentials comes from Fermat’s rule that characterizes 
the global minimizers (Fig. 1.4): 


Theorem 1.3 (Fermat’s Rule) Let f : Ht (—ov, oo] be proper. Then, 


argmin f = zerdf := {x €H|0€ df (x)}. (1.51) 


1.6.4 Convex Conjugate 


A convex conjugate or convex dual is very important concept for both classical and 
mordern convex optimization techniques. Formally, the conjugate function f* : 
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f(a) : differentiable f(x) = |2| 
— 0 € OF (0) 
Of (x) = {Vf (x)} Of (0) = [-1, 1] 


Fig. 1.4 Fermat’s rule for the global minimizer 


f(a) = f(x) =br+e 
Supporting 
Hyper-plane g(x) = ux 
sup{(u,x) — f(x)} 
(a) (b) 
Fig. 1.5 (a) Geometry of convex conjugate. (b) Examples of finding convex conjugate for f(x) = 
bx +c 


H +> [—0o, co] of a function f : Ht [—oo, oo] is defined as 


f*(u) = sup{(u, x) — fx}. (1.52) 


xe 


The transform in (1.52) is often called Legendre-Fenchel transform. 

Figure |.5a shows a geometric interpretation of the convex conjugate when H = 
R. For example, when f(x) = x* — x, the convex conjugate f*(u) at u = 1 is 
the maximum difference between g(x) = x and f(x) = x? — x, which occurs 
at x = | in this example. The difference is also equal to the magnitude of the y- 
intercept of the supporting hyerplane of f(x) at x = 1. Figure 1.5b shows another 
intuitive example. Here, f(x) = bx + c. In this case, the difference between the 
line g(x) = u,x and f(x) becomes infinite at x — —oo. Similarly, the difference 
between the line g2(x) = u2x and f(x) becomes infinite at x — oo. Only when 
u = b does, the maximum distance becomes finite and is equal to —c. Therefore, 
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Table 1.3. Examples of convex conjugate pairs used often in imaging problems. Here, D C Hand 
we use the interpretation 0 log 0 = 0 


f@) fw) dom f* 

f (ax) f*(u/a) D 

f@+b) f* (tu) — (6, u) D 

af(x),a>0 af*(u/a) D 

bx +e -—c y=a {a} 
+oo, ufxa 

1/x —2,./—u —Ry 

— logx —(1 + log(—u)) —Ry+ 

x log x etl R 

JVI+x2 —J/1— 12 [-1, 1] 

e ulog(u) —u Ry 

log(1 + e*) u log(u) + (1 — u) log — u) [0, 1] 

—log(1 — e*) ulog(u) + (1 + u) log + u) Ry 

a p>1 uw a ; =1 R 

IIx {° llullo <1 {we R": |lull2 <1 
oo lula > 1 

(a,x) +b —b, u= {b} CR" 

| co, ua 

1x7 Ox, QeESi4 dul Qr'u R" 

tc(x) Sc) H 

log (Di-1 e*) DieiMilogu;, Diyw=1 | RY 

—logdet X~! log det(—U)-! —n Sty 


Table 1.3 summarizes these findings for a variety of functions that are often used in 
applications. 

It is clear that f* is convex since f* is a point-wise supremum of a convex 
function of y. In general, if f : H+ [—00, ov], then the following hold: 


1. Fora € R14, we have 
(af)* = af*(-/a). (1.53) 
2. Fenchel—Young inequality: 


fx)+ f*(y) = (yx), Vx, y EH. (1.54) 
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3. Let f, g be proper functions from H to (—oo, oo]. Then, 
f(ix)+ex)>—-f*@)—g*(-u), Vx,u EH. (1.55) 


If f is convex, proper, and lower semicontinuous, then the following properties 
hold: 


7 = J, (1.56) 
y € Of (x) = > f(x) t+ f*(y) = (x, y) — x € OF*(y). (1.57) 


1.6.5 Lagrangian Dual Formulation 


Perhaps one of the most important uses of convex conjugate is to obtain the dual 
formulation. More specifically, for a given primal problem (P), 


(P): min f(x) + g(x), (1.58) 
xeH 
we can obtain the associated dual problem using (1.55): 
(D): —min f*(u)+ g*(-u). (1.59) 
ucH 
The gap between the primal and dual problem is called the duality gap. 
Example: Dual for Composite Function 
For the given primal problem: 


(P): ue f(x) + (Ax), (1.60) 


with A € R”*’”, the dual problem is given by 
(D): — min f*(A'u) + g*(—u). 
uceR” 


Proof Note that (P) is equivalent to the following constraint minimization 
problem: 


min f(x) + g(y) 
xy 
subjectto Ax=y, 


(continued) 
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(continued) 
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where the last equality comes from x = A! at the minimizer. Hence, the 
dual problem becomes 


D: min TAA a +ulb. 
ueR” 2 

Why is this dual formulation useful? Suppose that A is highly ill-posed, 

say that n = 1000 and m = 1. Then, the dual problem (D) is a one- 

dimensional problem which is computationally much less expensive than the 

primal problem (P) of the dimension n = 1000. After the dual solution w is 

obtained, the primal solution is just = A! a. 


We formally define a Lagrangian dual problem. 
Definition 1.16 ([6]) Suppose that a primal problem is given by 


min —fo(x) 
x 
subject to fj(x) <0,i=1,---,n, (1.61) 
h(x) =0,i=1,---,p. (1.62) 


Then, the associated Lagrangian dual problem is defined by 


max g(a, v) (1.63) 
av 
subject to aw > 0, (1.64) 
where aw = [a1,--- ,@,] and v = [v1,--- , vp] are referred to as the dual variables 


or Lagrangian multipliers, o > 0 implies that each element is non-negative, and the 
Lagrangian g(q@, v) is defined by 


n Pp 
g(o, v) := inf folx) + >> aj file) + Do vjhjx) p. (1.65) 


i=1 j=1 


One of the important findings in convex optimization theory [6] is that if the 
primal problem is convex, then we have the following strong duality: 


g(a*,v*) = fo(x*), (1.66) 


where x* and «*, v* are the optimal solutions for the primal and dual problems, 
respectively. Often, the dual formulation is easier to solve than the primal problem. 
Additionally, there is also interesting an geometric interpretation, which will be 
explained later. 
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1.7. Exercises 


BRWNR 


13. 
14. 
15. 


. Show that an/, norm with 0 < p < 1 is not a norm. 

. Prove the equalities in (1.17). 

. Prove the matrix inversion lemma, Eq. (1.23). 

. Letx € R", y € R” and A € R”*". Then, show the following: 


& = arg min |[y — Ax||? + Allx||? 
xeR" 
=(A'A+AI)'Aly 
= A!'(AA'+ AD 7'y, 


where A! denotes the transpose of A, and I is an appropriate size identity 
matrix. (Hint: for the last equality, you need to use the matrix inversion lemma.) 


. Prove Lemma 1.2. 

. Prove (1.31). 

. Prove Lemma 1.4. 

. Prove Lemma 1.7. 

. Show that if £ is an affine mapping and f is convex, then f o L is also convex, 


where o refers to the composite function. 


. Find at least three examples of functions that are not semicontinuous. 
. In Table 1.1, show that the relative entropy, indicator function, support function, 


p-norm (with p > 1) and max functions are convex. 


. Let f : A+ (—00, ow] be proper. Suppose that dom f is open and convex, 


and f is Gateux differentiable on dom/. Then, show that the following are 
equivalent: 


. f is convex. 

f= f@)+(y—-x,VFR)), Vx, yeH. 
(y-x, Vf) -—Vf@)) 20, Vx, y EH. 

. Moreover, if f is twice Gateux differentiable on dom f, 


Bags 


V* f(x) =0, Vx €domf. 


Let f(x) = |x| with x € [—1, 1]. Find its subdifferential df (x). 
Prove Fermat’s rule in Theorem 1.3. 
Show that the following properties hold for the subdifferentials: 


. If f is differentiable, then df (x) = {V f(x)}. 

. Let f be proper. Then, df (x) is closed and convex for any x € dom/. 

. Leta € Ry+. Then, dAf) = Ad0f. 

. Let f, g be convex, and lower semicontinuous functions, and L£ is a linear 
operator. Then 


aap 


AftgoL =daftL*o(dag)oL. (1.67) 
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16. 
17. 
18. 


19. 


20. 
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Prove Eq. (1.53). 
Let f(x) = 5 (x? + x3) — x1 — X2. Derive the convex conjugate f*(x). 
Let f be a proper function from H to (—oo, oo]. Show that 


f(x) + f*(y) = (y, x), Vx, y EH. 
If f is convex and lower semicontinuous, then show that 
(af)! = af*. 
We often have the following form of the primal problem: 
(P): — f(x) + (Ax), (1.68) 
where 
g(Ax) = [Axl fe) = lly — 2113 


with the operator A : R” + R””. Show that the associated dual problem is 
given by 


— min u'AAlu+ y'Alu 
uceR” 


subject to J@|l2 <1. 


Chapter 2 ®) 
Linear and Kernel Classifiers hook for 


2.1 Introduction 


Classification is one of the most basic tasks in machine learning. In computer vision, 
an image classifier is designed to classify input images in corresponding categories. 
Although this task appears trivial to humans, there are considerable challenges with 
regard to automated classification by computer algorithms. 

For example, let us think about recognizing “dog” images. One of the first 
technical issues here is that a dog image is usually taken in the form of a digital 
format such as JPEG, PNG, etc. Aside from the compression scheme used in 
the digital format, the image is basically just a collection of numbers on a two- 
dimensional grid, which takes integer values from 0 to 255. Therefore, a computer 
algorithm should read the numbers to decide whether such a collection of numbers 
corresponds to a high-level concept of “dog”. However, if the viewpoint is changed, 
the composition of the numbers in the array is totally changed, which poses 
additional challenges to the computer program. To make matters worse, in a natural 
setting a dog is rarely found on a white background; rather, the dog plays on the 
lawn or takes a nap in the living room, hides underneath furniture or chews with her 
eyes closed, which makes the distribution of the numbers very different depending 
on the situation. Additional technical challenges in computer-based recognition of 
a dog come from all kinds of sources such as different illumination conditions, 
different poses, occlusion, intra-class variation, etc., as shown in Fig. 2.1. Therefore, 
designing a classifier that is robust to such variations was one of the important topics 
in computer vision literature for several decades. 

In fact, the ImageNet Large Scale Visual Recognition Challenge (ILSVRC) [7] 
was initiated to evaluate various computer algorithms for image classification at 
large scale. ImageNet is a large visual database designed for use in visual object 
recognition software research [8]. Over 14 million images have been hand-annotated 
in the project to indicate which objects are depicted, and at least one million of 
the images also have bounding boxes. In particular, ImageNet contains more than 
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Fig. 2.1 Technical challenges in recognizing a dog from digital images. Figures courtesy of Ella 
Jiwoo Ye 


20,000 categories made up of several hundred images. Since 2010, the ImageNet 
project has organized an annual software competition, the ImageNet Large Scale 
Visual Recognition Challenge (ILSVRC), in which software programs compete for 
the correct classification and recognition of objects and scenes. The main motivation 
is to allow researchers to compare progress in classification across a wider variety 
of objects. Since the introduction of AlexNet in 2012 [9], which was the first 
deep learning approach to win the ImageNet Challenge, the state-of-the art image 
classification methods are all deep learning approaches, and now their performance 
even surpasses human observers. 

Before we discuss in detail recent deep learning approaches, we revisit the 
classical classifier, in particular the support vector machine (SVM) [10], to discuss 
its mathematical principles. Although the SVM is already an old classical technique, 
its review is important since the mathematical understanding of the SVM allows 
readers to understand how the modern deep learning approaches are closely related 
to the classical ones. 

Specifically, consider binary classification problems where data sets from two 
different classes are distributed as shown in Fig. 2.2a,b,c. Note that in Fig. 2.2a, the 
two sets are perfectly separable with linear hyperplanes. For the case of Fig. 2.2b, 
there exists no linear hyperplane that perfectly separates two data sets, but one could 
find a linear boundary where only a small set of data are incorrectly classified. 
However, the situation in Fig.2.2c is much different, since there exists no linear 
boundary that can separate the majority of elements of the two classes. Rather, one 
could find a nonlinear class boundary that can separate the two sets with small errors. 
The theory of the SVM deals with all situations in Fig. 2.2a,b,c using a hard-margin 
linear classifier, soft-margin linear classifier, and kernel SVM method, respectively. 
In the following, we discuss each topic in detail. 
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Fig. 2.2. Examples of binary classification problems: (a) linear separable case, (b) approximately 
linear separable case, and (c) linear non-separable case 
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2.2.1 Maximum Margin Classifier for Separable Cases 


For the linear separable case in Fig. 2.2a, there can be an infinite number of choices 
of linear hyperplanes. Among them, one of the most widely used choices of the 
classification boundary is to maximize the margin between the two classes. This is 
often called the maximum margin linear classifier [10]. 

To derive this, we introduce some notations. Let {x;, yi} , denote the set of the 
data x; € X C R¢ with the binary label y; such that y; € {1, —1}. We now define a 
hyperplane in R?: 


(w,x)+b=w'x+b=0, (2.1) 
where | denotes the transpose, (-, -) is the inner product, b € R is a bias term. See 


Fig. 2.3 for more details. If the two classes are separable, then there exist sets S; and 
S_; such that the data set with y; = 1 and y,; = —1 belongs to the sets S; and S_,, 


margin := el 
ihe e SS 
\. e 
e = © ° 
e \ % support vectors 
6 Ly \ 
oN fw & 

S-1| @ - ‘(w,2)+b=1 

(w, Z) +b=-1 


Fig. 2.3. Geometric structure of hard-margin linear support vector machine classifier 
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respectively: 


S| = {x € R¢| (w,x) +b > Vj, (2.2) 
S_1 ={x © R?| (w,x) +b <I}. (2.3) 


Then, the margin between the two sets is defined as the minimum distance between 
the two linear boundaries of S; and S_;. To calculate this, we need the following 
lemma: 


Lemma 2.1 The distance between two parallel hyperplanes €, : (w,x) +c, = 0 
and €2 : (w, x) + cz = 0 is given by 


Ic) — c2| 
= ——— 2.4 
|| w |) ca 


Proof Let m be the distance between the two parallel hyperplanes £; and £2, then 
there exists two points x € £1 and x2 € £2 such that ||x; —x2|| = m. Then, using the 
Pythagoras theorem, the vector v := x; — x2 should be along the normal direction 
of the hyperplanes. Accordingly, 


m = ||x1 — Xai] = ||(w/||wl], x1) — (w/||w], x2) IL, 
since w/||w]|| is the unit normal vector of the hyperplanes. Therefore, we have 


|(w, x1) —(w,x2)|| — |e1 — c2| 


|| w || || w|| 


Q.E.D. Oo 


Since (w,x) +b—1 = 0 and (w,x) + b+ 1 = 0 correspond to the linear 
boundaries of S; and S_;, Lemma 2.1 informs us that the margin between the two 
classes is given by 


2 
margin := —. (2.5) 
|| w || 


Therefore, for the given training data set {x;, y}/_, with x; <¢ X C R¢ and the 
binary label y; € {1,—1}, the maximum margin linear binary classifier design 
problem can be formulated as follows: 


ar) 
(P) ming 5 |u| (2.6) 


subject to 1 — yj ((w,x;) +b) <0, Vi. (2.7) 
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Note that the minimization of ||w||7/2 in (2.6) is equivalent to the maximization 
of the margin 2/||w]|?, and by noting that y; = 1 and —1 for the sets S; and S_y, 
respectively, we can see that (2.7) corresponds to the desirable constraints. Another 
thing to note here is that although the cost minimization in (P) is with respect to w, 
the dependency on b is hidden in this formulation. The explicit dependency on b 
becomes more evident in its dual formulation described in the following. 


2.2.2. Dual Formulation 


The optimization problem (P) is a constrained optimization problem under inequal- 
ity constraints. A standard method for the constrained optimization problem is to 
use the Lagrangian dual formulation [6]. In the following, we formally define a 
Lagrangian dual problem. 


Definition 2.1 [6] Suppose that a primal problem is given by 
min fo(x) 
x 
subject to fj(x) <0,i=1,---,n (2.8) 
hi(x) =0,i=1,---,p. (2.9) 


Then, the associated Lagrangian dual problem is defined by 


max g(a, v) (2.10) 
av 
subject to a > 0, (2.11) 
where aw = [a1,--- ,@] and v = [v1,--- , vp] are referred to the dual variables or 


Lagrangian multipliers, « > 0 implies that each element is non-negative, and the 
Lagrangian g(a, v) is defined by 


n P 
g(o, v) := inf fo(x) + Sa file) + Yo vjhj(x) f. (2.12) 


i=l j=l 


One of the important findings in convex optimization theory [6] is that if the 
primal problem is convex, then we have the following strong duality: 


g(a*,v*) = fo(x*), (2.13) 


where x* and w*, v* are the optimal solutions for the primal and dual problems, 
respectively. Often, the dual formulation is easier to solve than the primal problem. 
Additionally, there is also interesting geometric interpretation. 


34 2 Linear and Kernel Classifiers 


Our binary classification problem (P) in (2.6) is a convex optimization problem 
with respect to w € R%, since both the objective function and the constraint sets are 
convex. Therefore, using Definition 2.1, the original problem can be converted to a 
dual problem: 


(D) Max, g(a) 


subject to a > 0, 


where w = [a1,--- , @,] is a dual variable with respect to the primal variable w and 
b, and 
g(a) = mn jp lWF Sat ;((w, x;) + b)) (2.14) 
2 — i Vi Xj . : 
= 


At the minimizers of (2.14), the derivatives with respect to w and b should be zero, 
which leads to the following first-order necessary conditions (FONC): 


n n 
w= do aivixi, diay: = 0. (2.15) 
i=l i=l 


The FONCs in Eq. (2.15) have very important geometric interpretations. For 
example, the first equation in (2.15) clearly shows how the normal vector for the 
hyperplanes can be constructed using the dual variables. The second equation leads 
to the balancing conditions. These will be explained in more detail later. 

By plugging these FONCs into (2.14), the dual problem (D) becomes 


s Da 5D Dae Oj Vi Vj (Xi, Xj) (2.16) 


i=1 j=1 


n 
subject to So aii =0, a, >0, Vi. 
i=l 


Let w*, b* and a* denote the solutions for the primal and dual problems. Then, the 
resulting binary classifier is given by 


y < sign({w*, x) + b*) (2.17) 
for the case of the primal formulation, or 
n 
y < sign (Sefer 0) (2.18) 
i=1 


for the case of the dual formulation, where sign(x) denotes the sign of x. 
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2.2.3 KKT Conditions and Support Vectors 


To achieve the strong duality in (2.13), the so-called Karush—Kuhn—Tucker (KKT) 
conditions should be satisfied [6]. More details on the KKT conditions can be found 
in the standard convex optimization textbook [6], so here we briefly introduce the 
core condition that is directly related to geometric understanding of the maximum 
margin linear classifier. 

More specifically, suppose that x* and a*, v* denote the optimal solutions for 
the primal and dual problems, respectively. Then, we have 


n P 
g(a*,v*) = folx*) + 0 af file) + D0 veh je*) 


i=1 j=l 


= folx") + > af fie"), (2.19) 


i=1 
where the last equality comes from the constraint h ;(x*) = 0 in the primal problem. 
In order to make (2.19) equal to fo(x*), which corresponds to the strong duality 
(2.13), the following condition should be satisfied: 
of > 0S] 72") =0 o fF") <0 a7 =0. (2.20) 


This is the key KKT condition. 
If (2.20) is applied to our classifier design problem, we have 


a; > O => yi((w", x;) +b) = 1, (2.21) 


which implies that in constructing the normal vector direction w* of the hyperplane 
using (2.15), only the training data at the class boundaries contribute: 


n 
wt =) atyx;) = )) af; — > afxi, (2.22) 
i=1 


ieIt ieI— 
where J+ and J~ are index sets such that 
I* ={i €[1,---,n]| (w*,x;) +b = Ij, (2.23) 
I~ =fief[l,---,n]| (w*,x;)+b=-—-l}. (2.24) 


On the other hand, for the case of the training data x; inside the class boundaries, 
yi((w, x;) +b) > 1. Therefore, the corresponding Lagrangian variable a; becomes 
zero. This situation is illustrated in Fig. 2.3. Here, the set of the training data x; with 
i € I* ori € I~ is often called the support vector, which is why the corresponding 
classifier is often called the support vector machine (SVM) [10]. 
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Finally, the second equation in (2.15) leads to additional geometric relationship 
between nonzero dual variables: 


ero, * 
) a; = ) Q;, 
ieI+ iel— 


which states the balancing condition between dual variables. In other words, the 
weighting parameters for the support vectors should be balanced for each class 
boundary. 


2.3 Soft-Margin Linear Classifiers 
2.3.1 Maximum Margin Classifier with Noise 


As shown in Fig. 2.2b, many practical classification problems often contain data 
sets that cannot be perfectly separable by a hyperplane. When the two classes are 
not linearly separable (e.g., due to noise), the condition for the optimal hyperplane 
can be relaxed by including extra terms: 


yi((w,x;)+b)2=1-&, & 20 Vi, (2.25) 
where &; are often called the slack variables. The role of the slack variables is to 


allow errors in the classification. Then, the optimization goal is to find the classifier 
with the maximum margin with the minimum errors as shown in Fig. 2.4. 


Fig. 2.4 Geometric structure of soft-margin linear support vector machine classifier 
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The corresponding primal problem is then given by 


; 1 n 
(P)) minyg 5llwll* +C dX gi 
i= 
subject to 1 — y; ((w,x;) +b) < &, (2.26) 
eau, Mi, 
where the optimization problem again has implicit dependency on the bias term bD. 
The following theorem shows that the corresponding dual problem has a form very 


similar to the hard-margin classifier in (2.16) with the exception of the differences 
in the constraint for the dual variables. 


Theorem 2.1 The Lagrangian dual formulation of the primal problem in (2.26) is 
given by 


n 


n n 
1 
max ae = oe Ysajojyiyj (xi. x) (2.27) 
i= 


i=1 j=1 
n 
subject to So aii =0, O<a;<C, Vi. 
i=l 
Proof For the given primal problem in (2.26), the corresponding Lagrangian dual 
is given by 
max g(@,y) 
ay 
subjectto a >0, y > 0, (2.28) 
(a, y) | Pecye (2.29) 
a, y) = max j —||w i . 
Sa, y ree) 2 2. i 


+>) aj(1 — yj ((w, x4) +b) — 8) — D8 


i=1 i=1 


The first-order necessary conditions (FONCs) with respect to w, b and & lead to the 
following equations: 


n 
w=) ajyixi (2.30) 
i=l 
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and 
n 
Yoaiyi=0, a + yi =C. (2.31) 
i=l 


By plugging (2.30) and (2.31) into Eq. (2.29), we have 
n 1 n n 
gia. y)= dla — 5D) aiajyiy; (xi. x;), 
i=) isi j=! 


where 0 < a; < C, since yj = C — a; > 0. This concludes the proof. oO 


Another way of representing the primal problem in (2.26) is using the so-called 
hinge loss (10, 11]: 


Lhinge(Y, 3) = max{0, 1 — yy} ’ (2.32) 


of which a pictorial description is given in Fig. 2.5. Specifically, we define the slack 
variable: 


& = 1—y;((w,x;) +b). 


To make the slack variable represent the classification error for the data set (x;, y;) 
within the class boundary, & should be zero when the data is already well classified, 
but positive when there exists a classification error. This leads to the following 
definition of the slack variable: 


& = max{0, 1 — y;((w, x;) + b)} = hinge (Yi, (W, Xi) +b). (2.33) 
Then, the primal problem in (2.26) can be represented by 

min 5[|wl|? + C Yi Cninge (vi, (w, Xi) +B). (2.34) 

Fig. 2.5 Pictorial description 


of hinge loss 
Lhinge(Y; jy) = max{0, 1—y3} 


lninge(Y, j= max{0, 1 — y3} 


l/y 
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Later, we will show that this representation is closely related to the so-called 
representer theorem [11]. 


2.4 Nonlinear Classifier Using Kernel SVM 


2.4.1 Linear Classifier in the Feature Space 


Now consider a classification problem in R* as shown in Figs. 2.6 or 2.2c, where 
there exists no linear hyperplane that can separate two classes. Specifically, the data 
in class 1 are within an ellipse: 


Si = {x = (x1, x2) | 1 + x2)? +27 < 2}, (2.35) 


whereas class 2 data are located outside of the ellipse. This implies that although 
the two classes of data cannot be separated by a single hyperplane, the nonlinear 
boundary in (2.35) can separate the two classes. 

Interestingly, the existence of the nonlinear boundary implies that we can find 
the corresponding linear hyperplane in the higher-dimensional space. Specifically, 
suppose we have a nonlinear mapping g : x = [x1, x2]' +> (x) to the feature 
space in R? such that 


= 
(x) =[91, 92, 21" = [x?, x3, V2x1x2] (2.36) 


Then, we can easily see that S$; can be represented in the feature space by 


St ={(1, 92, 3) | Gi + 292 + V293 < 2}. (2.37) 


Fig. 2.6 Lifting to a high-dimensional feature space for linear classifier design 
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Therefore, there exists a linear classifier in R* using the feature space mapping g(x) 
as shown in Fig. 2.6. 

In general, to allow the existence of a linear classifier, the feature space should 
be in a higher-dimensional space than the ambient input space. In this sense, the 
feature mapping g(x) works as a lifting operation that lifts up the dimension of the 
data to a higher-dimensional one. In the lifted feature space by the feature mapping 
g(x), the binary classifier design problem in (2.27) can be defined as 


x Da = a9 > ae) (g(xi), 9(x;)) (2.38) 


i=1 f=] 


n 
subjectto YS ajy;=0, O<aj<C, Vi. 
i=l 


By extending (2.18) from the linear classifier, the associated nonlinear classifier 
with respect to the optimization problem (2.38) can be similarly defined by 


y < sign (= at vi(@(xXi), P(X)) + v), (2.39) 


where a* and b are the solutions for the dual problem. 


2.4.2. Kernel Trick 


Although (2.38) and (2.39) are nice generalizations of (2.27) and (2.18), there exist 
several technical issues. One of the most critical issues is that for the existence of a 
linear classifier, the lifting operation may require a very-high-dimensional or even 
infinite-dimensional feature space. Therefore, an explicit calculation of the feature 
vector g(x) may be computationally intensive or not possible. 

The so-called kernel trick may overcome this technical issue by bypassing the 
explicit construction of the lifting operation [11]. Specifically, as shown in (2.38) 
and (2.39), all we need for the calculation of the linear classifier is the inner product 
between the two feature vectors. Specifically, if we define the kernel function K : 
X x X + Ras follows: 


K(x, x’) := (p(x), g(x’)) (2.40) 
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then (2.38) and (2.39) can be converted to 


n n n 
1 
max y a — >) y ajo jyiyj) K (xi, x;) (2.41) 
i=l 


fS=L j=! 
n 
subjectto S aiy;=0, O<aj<C, Vi 
i=l 


and the resulting classifier is 
y < sign (oy, of yi K (xi, x) +). (2.42) 
For example of (2.36), the corresponding kernel is given by 
K(x, y) = xtyt +.x3y3 + 2ix2y1y2 = (x,y), 


which corresponds to a polynomial function with degree 2. Therefore, the common 
practice in SVM literature is to design the kernel directly rather than to obtain it 
from the underlying feature mapping. The following are representative examples of 
kernels that are often used in the kernel SVM. 


¢ Polynomial kernel with degree exactly p: 


Kix, 9) = (@" y)’. 
¢ Polynomial kernel with degree up to p: 
K(x, y)= (ely +D?. 
¢ Radial basis function kernel with width o: 
K(x, y) = exp(—||x — yll?/(20°)). 
¢ Sigmoid kernel: 
tanh(nx | y + v). 

However, care should be taken since not all kernels can be used for SVM. To 
be a viable option, a kernel should originate from the feature space mapping g(x). 
In fact, there exists an associated feature mapping if the kernel function satisfies 
the so-called Mercer’s condition [11]. The kernel that satisfies Mercer’s condition 
is often called the positive definite kernel. The details of Mercer’s condition can be 


found from standard SVM literature [11] and will be explained later in the context 
of the representer theorem. 
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2.5 Classical Approaches for Image Classification 


Although the SVM and its kernel extension are beautiful convex optimization 
frameworks devoid of local minimizers, there are fundamental challenges in using 
these methods for image classification. In particular, the ambient space X should not 
be significantly large in the SVM due to the computationally extensive optimization 
procedure. Accordingly, one of the essential steps of using the SVM framework is 
feature engineering, which pre-processes the input images to obtain significantly 
smaller dimensional vector x € X that can capture all essential information of the 
input images. For example, a classical pipeline for the image classification task can 
be summarized as follows (see Fig. 2.7): 


e Process the data set to extract hand-crafted features based on some knowledge of 
imaging physics, geometry, and other analytic tools, 

* or extract features by feeding the data into a standard set of feature extractors 
such as SIFT (the Scale-Invariant Feature Transform) [12], or SURF (the 
Speeded-Up Robust Features) [13], etc. 

¢ Choose the kernels based on your domain expertise. 

¢ Put the training data composed of hand-crated features and labels into a kernel 
SVM to learn a classifier. 


Here, the main technical innovations usually comes from the feature extraction, 
often based on the serendipitous discoveries of lucky graduate students. Moreover, 
kernel selection also requires domain expertise that was previously the subject of 
extensive research. We will see later that one of the main innovations in the modern 
deep learning approach is that this hand-crafted feature engineering and kernel 
design are no longer required as they are automatically learned from the training 
data. This simplicity can be one of the main reasons for the success of deep learning, 
which led to the deluge of new deep tech companies. 

So far we have mainly discussed the binary classification problems. Note that 
more general forms of the classifiers beyond the binary classifier are of importance 
in practice: for example, ImageNet has more than 20,000 categories. The extension 
of the linear classifier for such a setup is important, but will be discussed later. 


Kernel 
— svM |——> “Dog” 


Feature extraction 
(SIFT, SURF, HOG) 


Fig. 2.7 Classical classifier design flowchart 
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2.6 Exercises 


1. For a given polynomial kernel up to degree 2, 
kx y)=(xly+o), x,y eR’, 


what is the corresponding feature mapping g(x) such that k(x,y) = 
(p(x), P(y))? 

2. Show that the feature space dimension for the radial basis function is infinite. 

3. Suppose we are given the following positively labeled data points: 


x1 = (2, 1)", x2 =[2,—1]", x3 = [3, 11, (2.43) 
and the following negatively labeled data points: 
x4 =[1,0]', x5 =[0,1]', x6 =[0,-1]'. (2.44) 


a. Are the two classes linear separable? Answer this question by visualizing their 
distribution in R?. 

b. Now, we are interested in designing a hard-margin linear SVM. What are 
the support vectors? Please answer this by inspection. You must give your 
reasoning. 

c. Using primal formulation, compute the closed form solution of the linear 
SVM classifier by hand calculation. You must show each step of your 
calculation. The inequality constraints may be simplified by exploiting the 
support vectors and KKT conditions. 

d. Using dual formulation, compute the closed form solution of the linear SVM 
classifier by hand calculation. You must show each step of your calculation. 
The inequality constraints may be simplified by exploiting the support vectors 
and KKT conditions. 


4. Suppose we are given the following positively labeled data points: 
x1 =[0.5,0]', x2 =[1.5, 1)", x3 =[1.5, -1]", x4 =[2,0]', (2.45) 
and the following negatively labeled data points: 
x5 =[1,0]',x6 =[0, 1’, x7 =[0,-1]', xg =[-1, 0] '. (2.46) 


a. Are the two classes linearly separable? Answer this question by visualizing 
their distribution in R?. 

b. Now, we are interested in designing a soft-margin linear SVM. Using 
MATLAB, plot the decision boundaries for various choices of C. 

c. What do you observe when C — 00? 
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5. Suppose we are given the following positively labeled data points: 
x1 = [3,3]', x2 =[3,-3]', x3 =[-3, -3]', x4 =[-3, 3], (2.47) 
and the following negatively labeled data points: 
x5 =(1,1]',x6=[1, -1]', x7 =[-1, -1]', xg =[-1, 1]. (2.48) 


a. Are the two classes linearly separable? Answer this question by visualizing 
their distribution in R?. 

b. Find a feature mapping g : R* + F C R’ so that two classes are linear 
separable in the feature space F'. Show this by drawing data distribution in F. 

c. What is the corresponding kernel? 

d. What are the support vectors in F’? 

e. Using dual formulation, compute the closed form solution of a kernel SVM 
classifier by hand calculation. You must show each step of your calculation. 
The inequality constraints may be simplified by exploiting the support vectors 
and KKT conditions. 


Chapter 3 Mm) 
Linear, Logistic, and Kernel Regression spoke 


3.1 Introduction 


In machine learning, regression analysis refers to a process for estimating the 
relationships between dependent variables and independent variables. This method 
is mainly used to predict and find the cause-and-effect relationship between 
variables. For example, in a linear regression, a researcher tries to find the line 
that best fits the data according to a certain mathematical criterion (see Fig. 3.1a). 
Another important regression problem is the logistic regression. For example, in 
Fig.3.1b, the dependent variables are binary properties such as yes or no for a 
given question, and the goal is to fit the binary data using continuously varying 
independent variables. It is easy to understand that this problem is closely related 
to the binary classification problem. For the case of Fig.3.lc, the technical issue 
is a bit different from the other two. Here, the distribution cannot be regressed out 
by a linear line. Moreover, the dependent variable is not binary, but has continuous 
values. In fact, a better regression approach is to fit the data with a smoothly varying 
curve. In fact, this is directly related to a nonlinear regression problem. 

Although regression analysis is a classical approach that can be dated back to 
the least squares method by Legendre in 1805 and by Gauss in 1809, regression 
analysis is still a key idea of the deep learning approaches, as will be discussed 
later. Therefore, we will visit the classical regression approach to discuss three 
specific forms of regression analysis: linear regression, logistic regression, and 
kernel regression. Later on, this overview will prove useful in understanding modern 
regression approaches using deep neural networks. 
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(b) 


Fig. 3.1 Example of various regression problems. The x-axes are for the independent variables, 
and y-axes are for the dependent variables. (a) linear regression, (b) logistic regression, and (c) 
nonlinear regression using a polynomial kernel 


3.2 Linear Regression 


3.2.1 Ordinary Least Squares (OLS) 


A linear regression uses a linear model as shown in Fig. 3.1a. More specifically, 
the dependent variable can be calculated from a linear combination of the input 
variables. It is also common to refer to a linear model as Ordinary Least Squares 
(OLS) linear regression or just Least Squares (LS) regression. For example, a simple 
linear regression model is given by 


yi =fho+ Pixs +e;, t=1,---,n (3.1) 
and the goal is to estimate the parameter set B = {8o, 61} from the training data 


{xi, vi}i_y- 
In general, a linear regression problem can be represented by 


yi = (xj;,B) +6, t=1,---,n, (3.2) 


where (x;, yj) € R? x Ris the i-th training data, and B € R? is referred to as the 
regression coefficient. This can be represented in matrix form as 


y=X'Brte, (3.3) 
where 
yi €] 
Yn En 


In this mathematical formulation, x; corresponds to the independent variable, 
whereas yj; is the dependent variable. 
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Then, the regression analysis using /2 loss or the mean squared error (MSE) loss 
can be done by 


. 1 
min €(B),  €(B) = slly— X' Bll’, (3.4) 
where the loss can be further expanded as 


1 
€(B) = sly - xX" 


1 
50> X'B)'(y—X"B) 


= 5 (v'y—yX7B— BT Xy + BTXX78), 


The parameter that minimizes the MSE loss can be found by setting the gradient 
of the loss with respect to B to zero. To calculate the gradient for the vector-valued 
function, the following lemma is useful. 


Lemma 3.1 [5] Let x,a and B denotes vectors and a matrix with appropriate 
sizes, respectively. Then, we have 


axla da'x 


dx. ox.” 2) 
ax' Bx + 
= (B+ B')x. (3.6) 


Using Lemma 3.1, we have 


ae K 
ae) =-Xy+Xx'p=0, 

dB |p 
where B is the minimizer. If XX ' is invertible, or X has the full row rank, then we 
have 


p= x)” Xy. (3.7) 


The full rank condition is important for the existence of the matrix inverse, which 
will be revisited again in the ridge regression. 

This regression setup is closely related to the general linear model (GLM), which 
has been successfully used for statistical analysis. For example, GLM analysis is 
one of the main workhorses for the functional MRI data analysis [14]. The main 
idea of functional MRI is that multiple temporal frames of MR images of a brain 
are obtained during a given task (for example, motion tasks), and then the temporal 
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Fig. 3.2. General linear Preprocessed 
model for functional MRI fMRI data 
analysis 


GLM vs 
. Bs 
= + 
B, 
Time 


variation of the MR values at each voxel location is analyzed to check whether its 
temporal variation is correlated with a given task. Here the temporal time series data 
y from one voxel is described as a linear combination of the model (X T, which is 
often termed as the “design matrix”, containing a set of regressors as in Fig. 3.2 
representing the independent variable and the residuals (i.e., the errors), then the 
results are stored, displayed, and possibly analyzed further in the form of voxelwise 
maps as shown in the top right of Fig. 3.2 when B = [f1, Bo]. 


3.3 Logistic Regression 
3.3.1 Logits and Linear Regression 


Similar to the example in Fig. 3.1b, there are many important problems for which 
the dependent variable has limited values. For example, in binary logistic regression 
for analyzing smoking behavior, the dependent variable is a dummy variable: coded 
0 (did not smoke) or | (did smoke). In another example, one is interested in fitting a 
linear model to the probability of the event. In this case, the dependent variable only 
takes values between 0 and 1. In this case, transforming the independent variables 
does not remedy all of the potential problems. Instead, the key idea of the logistic 
regression is transforming the dependent variable. 
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Specifically, we define the term odds: 


odds? —, (3.8) 
I-q 

where g is a probability in a range of 0-1. The odds have a range of O—oo with 

values greater than | associated with an event being more likely to occur than to 

not occur and values less than | associated with an event that is less likely to occur. 

Then, the term /ogit is defined as the log of the odds: 


logit := log(odds) = log (4) : 
4 


This transformation is useful because it creates a variable with a range from —oo to 
oo with zero associated with an event equally likely to occur and not occur. One of 
the important advantages of this transformation of the dependent variable is that it 
solves the problem we encountered in fitting a linear model to probabilities. If we 
transform our probabilities to logits, then the range of the logit is not restricted, so 
that we can apply a standard linear regression. 

Specifically, using the logits transform, a linear regression model for the proba- 
bility is given by 


roe (4) = fo + Bix, (3.9) 
—q 
from which we have 
_ 1 
: | + e—(Bo+Bix) 
= Sig(Bo + Bix), (3.10) 


where Sig(x) denotes the sigmoid function: 


1 
Sig(x) = ie 
whose shape is shown in Fig.3.3. It is remarkable that although the nonlinear 
transform is originally applied to the dependent variable for linear regression, the 
net result is the introduction of the nonlinearity after the linear term. In fact, this is 
closely related to the modern deep neural networks that have nonlinearities after the 
linear layers. 
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Fig. 3.3. Sigmoid function 


. - 1 
Sig(x) = acm 
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Fig. 3.4 Multi-class 
classification problem 
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3.3.2 Multiclass Classification Using Logistic Regression 


In SVM, we mainly discussed the binary classification problem, in which a 
hyperplane is defined to separate two classes. Now, consider Fig.3.4, where we 
want to define three hyperplanes that can split the data into multiple categories. 

A direct extension of the SVM for the multiple class classifier design problem is 
to consider all the combinatorial combinations of the hyperplanes. More specifically, 
a data x; can be on either side of the hyperplane so that given three hyperplanes in 
Fig. 3.4, one could design a classifier that can potentially classify 2? = 8 classes. 
Although this approach may reduce the number of hyperplanes for a given number 
of classes c, one of the main technical difficulties of such extension of SVM is that 
we need to consider all combinatorial combinations of the constraint sets, which is 
difficult to implement. 

A quick remedy for this multi-class classifier design problem is to use the logistic 
regression. More specifically, for given c-class categories, we define a probability 
vector gq = [q1,-:: .qc]' © R°, where g; € [0, 1] denotes the probability that a 
data belongs to the class i. Then, by extending (3.9) to vector-valued probabilities 
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for a given dependent variable x € R”, we have 


log (4) 
log (=) 


where W € R?*° denotes the matrix composed of c-normal vectors in the p- 
dimensional spaces, and b € R° is the associated bias term. Then, we can easily 
see that the corresponding probability vector is given by 


=W'x+b (3.11) 


p = Sig(W'x +b), (3.12) 


where Sig(-) is an element-wise sigmoid function. Then, by ranking the magnitude 
of the probability, one could classify the data into the corresponding categories. 
In fact, this technique is a standard method in modern classifier design using deep 
neural networks. We will revisit this issue later. 


3.4 Ridge Regression 


Recall that the basic assumption for the linear regression solution in (3.7) is that 
X7 has full column rank or X has the full row rank. However, when X T is high- 
dimensional, the columns of X T can be collinear, which in statistical terms refers to 
the event of two (or multiple) covariates being highly linearly related. Consequently, 
X! may not be of full column rank or close to not being the full column rank, 
and we cannot use the standard linear regression. To deal with this issue, the ridge 
regression is useful. 
Specifically, the following regularized least squares problem is solved: 


min lridge(B), 
B 
where 
1 x 
Cridge(B) = sIl¥ — X' pI? + 5 IB’, (3.13) 


where A > 0 is the regularization parameter. This type of regularization is often 
called the Tikhonov regularization. Using Lemma 3.1, we can easily show 


Ilridge (B) 


age | = Xy + XX" B+ AB =0, 
dB \p3 
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which leads to 
A -1 
B= (xx" + AI) Xy. (3.14) 
Using the following matrix inversion lemma [3], 
-1 
(+uCV)'=1-U(c'+vU) V, (3.15) 
Eq. (3.14) can also be equivalently written by 
a 3 -1 
is (xx + ar) Xy 


Fences zs 
= = (xx /r+1) xy 


_ il 
"3, 


1 -1 
==x {7 — (ar+xTXx) xx] y 


-1 
{7 - X (r+X7X) x7 Xy 


1 —l 
=<Xx (a 4: xx) [a4 Be xX) = xTx} y 


= X(XTX+al) y. (3.16) 


In particular, the expression in (3.16) is useful when X is a tall matrix, since the 
size of the matrix inversion is much smaller than that of (3.14). Even if this is 
not the case, the expression in (3.16) is extremely useful to derive the kernel ridge 
regression, which is the main topic in the next section. 


3.5 Kernel Regression 


Recall that a nonlinear kernel SVM was developed based on the observation that 
the nonlinear decision boundary in the original input space can be often represented 
as a linear boundary in the high-dimensional feature space. A similar idea can be 
used for regression. Specifically, the goal is to implement the linear regression in 
the high-dimensional feature space, but the net result is that the resulting regression 
becomes nonlinear in the original space (see Fig. 3.5). 

In order to use a kernel trick similar to that used in the kernel SVM, let us revisit 
the linear regression problem in (3.2). Using the parameter estimation from the ridge 


3.5 Kernel Regression 53 


Lifting 


Fig. 3.5 Kernel regression concept 


regression (3.16), the estimated function f (x) for a given independent variable x € 
R? is given by 


A 7; 


f(x) :=x B 
=x! X(X'X+AD ly 
(¥1,X1) +++ (¥1, Xn) fe 
=[(x,x1) +++ (x, xn) | a ee AsAT y, (3.17) 


(Xn, X1) +++ (Xn, Xn) 


where we use 


and 


Xxly= : (eres, | = 


xy, (Xn, X1) -+* (Xn, Xn). 


Since everything is represented by the inner product of the input vectors, we can 
now lift the data x to a feature space using g(x) to compute the inner product in 
the high-dimensional feature space. Then, using the kernel trick, the inner product 
in the feature space can be replaced by the kernel: 


(x, Xj) > k(x, xj) = (G(X), Pi). (3.18) 
Accordingly, (3.17) can be extended to the feature space as: 


F(x) =[k(e, x1) ++ k(x, xn) | (K +aDe!y, (3.19) 
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where the K € R”*” is the kernel Gram matrix given by 


k(x1,X1) +++ K(X1, Xn) 
K:= : a : : (3.20) 


K(Xn,X1) +++ K(Xn, Xn) 


Equivalently, (3.19) can be derived from the following regression problem with 
kernel: 


P 
yi =) ajk(xi, xj) +¢ (3.21) 
j=l 


which is a nonlinear extension of (3.2). Then, (3.19) is obtained using the following 
optimization problem: 
2 
n Pp 
min yi — Do ajk(xi,xj)) tae’ Ko, (3.22) 
i=1 j=1 


where K is the kernel Gram matrix in (3.20). This implies that the regularization 
term should be weighted by the kernel to take into account of the deformation in 
the feature space. More rigorous derivation of (3.22) is obtained from the so-called 
representer theorem, which is the topic of the next chapter. 

Figure 3.6 shows the examples of linear regression and kernel regression using 
the polynomial and radial basis function (RBF) kernels. We can clearly see that 
nonlinear kernel regression follows the trend much better. 


Linear regression 


© RBF kernel 
regression 


Polynomial kernel 
° regression 


Fig. 3.6 Linear and nonlinear kernel regression 
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3.6 Bias—Variance Trade-off in Regression 


In this section, we will discuss the important issue of the bias and variance trade-off 
in regression analysis. 

Let {x;, y;}/_, denote the training data set, where x; € R? C X is an 
independent variable and y; € R? C Y is a dependent variable that has dependency 
on x;. The reason we use the boldface characters x; and y; is that they can be 
vectors. In regression analysis, the dependent variable is often represented as a 
functional relationship with respect to the independent variable: 


Yi = fo(*i) +i, (3.23) 


where e€; denotes an additive error term that may stand in for unmodeled parts, and 
J @(-) is a regression function (which can be possibly a nonlinear function) with the 
input variable x; and parameterized by ©. With a slight abuse of notation, we often 
use f := f@ when the dependency on the parameter O is obvious. 

In (3.23), © is the regression parameter set that should be estimated from the 
training data set. Usually, this parameter set is estimated by minimizing a loss. For 
example, one of the most popular loss functions is /2 or the MSE loss, in which case 
the parameter estimation problem is given by 


1 n 
min = DU llyi — fecal. (3.24) 
i=l 


Another popular tool that is often used in regression analysis is the regularization. In 
regularized regression analysis, an additional term is added to impose a constraint 
on the parameter. More specifically, the following optimization problem is solved to 
estimate the parameter O: 


it n 
min = ly — fori)? +28), (3.25) 
i=l 


where R(®) and 4 are often called the regularization function and regularization 
parameter, respectively. 
With the estimated parameter ©, the estimated function f is defined as 


f@) = fg). (3.26) 


Suppose that the noise € is zero mean i.i.d. Gaussian with the variance o7. Then, the 
MSE error of the regression problem is given by 


Elly — fil? = Elf +e- fll? 
= Elf +e—f+ElS]— Elf’ 
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= Elf — Efi’ + Elf — ELSI + Ellell 
=f — ELF? + Elf — ELPA? + Ellell? 
= |[Bias(f) ||? + Var(f) + po’, (3.27) 


where we use the following for the third equality: 


Ele'(f — ELf))1 = 0, 


Ele'(f — ELf))1 = 0, 
EW — ELF) — ELF) = 0, 


A 


and the fourth equation comes from the fact that f and E[f] are deterministic. 
Equation (3.27) clearly shows that the MSE expression of the prediction error 
is composed of bias and variance components. This leads to the so-called bias— 
variance trade-off in regression problem, which can be explained in detail in the 
following example. 


3.6.1 Examples 


Here, we will investigate the bias and variance trade-off for the linear regression 
problem, where the regression function is given by 


f(x) = (x, B) =x'B. (3.28) 


By defining the expectation operation E[-], the bias and variance of the OLS in (3.7) 
can be computed as follows: 


Bias(f) := x! B — E[x' B] 


=x!'B—x'E[(XX')'xy] 


=x'Bp—x'(XX')'XE[y] 


=x! B—x'(XX')'xx'p=0, 


since E[y] = E[X'B +e] = X'B+ Ele] = X'B. Since the bias is zero, i is 
often called an unbiased estimator. Similarly, the covariance can be computed by 


Var(f) = E [xT (B — B)(B - B)"x] 


=£ ie (XX")- Nee XT (XX! y'x| 
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—x"(XX")XE [ee"| XT(XXT) x 
=o-x!(XX')!x. 


On the other hand, the bias and covariance of the ridge regression in (3.14) are 
given by 


Bias(f) := x! B — E[x' (XX! +A1)7'!Xy] 
=x (1 ~ (xx! + AD |XX") B 
=ax'(XX'+,17'B, 
and 
Var(f) = E [xT (xxT 4D! XeeTX (XX + AD 1x] 
=o*x! (XX! +,.D'XX' (xX! +4aD7!y, (3.29) 


where we use E [ee] =o7l. 

Accordingly, we can see that as 4 becomes larger, the variance decreases and 
the bias increases as shown in Fig. 3.7. This implies that the bias—variance trade-off 
of a ridge regression depends on the regularization parameter. One could find the 
optimal parameter 4* that leads to the minimal total prediction error which gives 
the best bias—variance trade-off. The search for this optimal hyperparameter is one 
of the important research topics in classical ridge regression problems. 


Fig. 3.7 Bias-variance 
trade-off in ridge regression 


total error 


Bias( f) 
Var(f) 
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3.7 Exercises 


1. Prove the matrix inversion lemma in Eq. (3.15). 
2. The blood pressures, y (mmHg), and the ages, x years, of 7 patients are shown 
in the following table: 


Patient id 1 2 3 4 2 6 7 
57 
125 


a. Obtain the OLS estimate of blood pressure with respect to age. 
b. Plot the regression line on the scatter plots. 


3. A mechanic part is tested under various temperature conditions. The table below 
summarizes observational data on the part for 10 trials, where the all other 
experimental conditions are same except for the temperature (shown as degrees). 
Damaged represents the number of damaged parts, and Undamaged represents 
the number of parts that were not damaged. 


Trial id 1 2 3 4 5 6 7 8 9 10 
Temperature 

Damaged 5 1 1 1 0 0 0 0 0 1 
Undamaged 7 6 5 6 8 8 7 6 5 6 


a. Write down the logistic regression model. 
b. What is the estimated failure probability for a given temperature T? 


4. Show that the ridge regression in (3.14) is equivalent to the linear regression with 
the following augmented dependent and independent variables: 


pel X =[x Val], 


where I is the p x p identity matrix. 
5. Consider the regression problem in the following table, where x is the indepen- 
dent variable and y is the dependent variable. 


[a [2 [32 _[a_[55_[e7 [me [a9 [100 [50_[ 7194 


y | 2330 | 2750 | 2309 | 2500 | 2100 | 1120 | 1010 | 1640 | 1931 | 1705 | 1751 | 2002 
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a. Perform the linear regression. What is the remaining residual error? 
b. Consider the following Gaussian kernel: 


1 1(x-x; 2 
Kova) = eee 5 ( A P| 


c. Perform the kernel regression with h = 5, 10 and 15. What do you observe? 


6. By directly solving (3.22), derive the kernel regression in (3.17). 
7. Show that the variance of the kernel regression in (3.29) increases with decreas- 
ing regularization parameter 1. 


Chapter 4 m®) 
Reproducing Kernel Hilbert Space, sei 
Representer Theorem 


4.1 Introduction 


One of the key concepts in machine learning is the feature space, which is often 
referred to as the latent space. A feature space is usually a higher or lower- 
dimensional space than the original one where the input data lie (which is often 
referred to as the ambient space). Recall that in the kernel SVM, by lifting the data to 
a higher-dimensional feature space, one can find a linear classifier that can separate 
two different classes of samples (see Fig.4.la). Similarly, in kernel regression, 
rather than searching for nonlinear functions that can fit the data in the ambient 
space, the main idea is to compute a linear regressor in a higher-dimensional feature 
space as shown in Fig. 4.1b. On the other hand, in the principal component analysis 
(PCA), the input signals are projected on a lower-dimensional feature space using 
singular vector decomposition (see Fig. 4.1c). 

In this section, we formally define a feature space that has good mathematical 
properties. Here, the “good” mathematical properties refer to the well-defined 
structure such as existence of the inner product, the completeness, reproducing 
properties, etc. In fact, the feature space with these properties is often called the 
reproducing kernel Hilbert space (RKHS) [11]. Although the RKHS is only a small 
subset of the Hilbert space, its mathematical properties are highly versatile, which 
makes the algorithm development simpler. 

The RKHS theory has wide applications, including complex analysis, harmonic 
analysis, and quantum mechanics. Reproducing kernel Hilbert spaces are particu- 
larly important in the field of machine learning theory because of the celebrated 
representer theorem [11, 15] which states that every function in an RKHS that 
minimizes an empirical risk functional can be written as a linear combination of the 
kernel function evaluated at training samples. Indeed, the representer theorem has 
played a key role in classical machine learning problems, since it provides a means 
to reduce infinite dimensional optimization problems to tractable finite-dimensional 
ones. 
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V2x; x2 


Dimension 
Reduction 


(c) 


Fig. 4.1 Example of feature space embedding in (a) kernel SVM, (b) kernel regression, and (c) 
principle component analysis 


In this chapter, we review the RKHS theory and the representer theorem. Then, 
we revisit the classifier and regression problems to show how kernel SVM and 
regression can be derived from the representer theorem. Then, we discuss the 
limitation of the kernel machines. Later we will show how these limitations of kernel 
machines can be largely overcome by modern deep learning approaches. 


4.2 Reproducing Kernel Hilbert Space (RKHS) 


As the theory of the RKHS originates from core mathematics, the rigorous definition 
is very abstract, which is often difficult to understand for students working on 
machine learning applications. Therefore, this section tries to explain the concept 
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Reproducing Kernel Hilbert Space 


Inner Product 
Space 


Normed Space 


Vector Space 


Fig. 4.2 RKHS, Hilbert space, Banach space, and vector space 


from a more machine learning perspective so that students can understand why the 
RKHS theory has been the main workhorse in the classical machine learning theory. 

Before diving into details, the readers are reminded that the RKHS is only a 
subset of the Hilbert space as shown in Fig. 4.2, i.e. the Hilbert space is more 
general than the RKHS. For the formal definition of the Hilbert space, please refer 
to Chap. 1. 


4.2.1 Feature Map and Kernels 


Here we start with the formal definition of a kernel: 
Definition 4.1 Let X be a non-empty set. A function k : X x X & Ris called a 
kernel if there exists a Hilbert space H and a feature map @ : X  H such that 
Vxi,x' EX: 

k(x, x") = (f(x), O(x’)) 4. (4.1) 


For example, a feature mapping we used to explain the kernel SVM was 


(x) = [b1, 62, 63)" = [x2 x2 V2xx0]", (4.2) 
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where X = R? (see Fig. 4.1a). We also showed that the corresponding kernel is 
given by 


k(x, y) = (Ox), d(y)) 
= xtyt + x33 + 2x1x2y1y2 


= ((x, y))*, 


for all x = [x1 x2]', y = [1 y2]' € R?, which corresponds to a polynomial kernel 
with degree 2. Note also that the feature space can be infinite-dimensional, such as 
1?(Z). In this case, using the definition of the inner product in 1?(Z) (see (1.5)), the 
kernel is defined as 


k(x,x')= D> di(x)or(x’), 


l=—0o0 


where @ = {¢)}72_ 4, € H. 

Here it is important to emphasize that there exist almost no conditions on X, 
i.e. X does not need an inner product, etc. On the other hand, the feature space H 
should be a Hilbert space. This implies that the feature map imposes a mathematical 
structure to the data set which does not necessarily have mathematical structures. 
This is an important machine learning apparatus as it provides a versatile tool to 
set the mathematical structure for all data in practice. For example, the bag-of- 
words (BOW) kernel [16] used for document classification is such an example that 
imposes a mathematical structure for an unstructured data such as documentations 
(see Fig. 4.3). 
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Fig. 4.3. Bag-of-words embedding to the feature space 
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Example (Bag-of-Words Kernel) 

Suppose that the /-th element of the feature mapping @(x) for a document 
x denotes the number of /-th words (from a dictionary) appearing in the 
document x. If we want to classify documents by their word counts, we can 
use the kernel k(x, y) = ((x), b(y)). 


In the kernel SVM and/or kernel regression, the optimization problem for the 
design of a classifier and/or regressor is formulated using kernels without ever 
using the feature map. Then, if we are given a function of two arguments, k(x, x’), 
how can we determine if it is a valid kernel? To answer this question, we need to 
check whether there exists a valid feature map. For this, the concept of the positive 
definiteness is important. 


Definition 4.2 A symmetric function k : X x X } Ris positive definite if Vn > 1, 
Via, ee an) € R",V(x1, a ,Xn) € x" 


SS o aiajk(xi,x;) > 0. (4.3) 


i=1 j=1 


Although this condition is both necessary and sufficient, the forward direction is 
more intuitive in understanding why the kernel function should be positive definite. 
More specifically, if we define the kernel as in (4.1), we have 


n n 


> Saja jk(xi, t= > Se aja; (P(x; ), O(Xj)) 44 


i=l j=1 i=l j=) 
2 

> 0. 
H 


n 


S > ai (xi) 


i=1 


Therefore, existence of the feature mapping guarantees the positive definiteness of 
the kernel. 


4.2.2 Definition of RKHS 


With the definition of kernels and feature mapping, we are now ready to define the 
reproducing kernel Hilbert space. Toward this goal, let us revisit the feature mapping 
we used to explain the kernel SVM: 


$(x) = (1, 62, G21" = [x2 x2 V2e.m]" 
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Suppose we define a function f : X > R viaa feature maps: 
3 
fa) =>) fide) 


l=1 


= fixt + fxs + f3 (2x122) ; 


In terms of feature space coordinates, f is represented by /f(-): 


F=fHO=([AB fl. 


so that f(x) can be represented as an inner product: 


F(x) = (FC), O@)) 40 (4.4) 


where the feature map (x) is often called the point evaluation function at x in the 
RKHS literature. 

Now, the key ingredient of the RKHS is that rather than considering all of the 
Hilbert space H, we consider its subset Hy (recall Fig. 4.2) that is generated by the 
evaluation function @. More specifically, for all f(-) € Hg there exists a set {x;}"_,, 
x; € &X such that 


fO) = dai (xi). (4.5) 
i=l 


This is equivalent to saying that Hg is a linear span of {@(x) : x € X}. Then, by 
plugging (4.5) into (4.4), we have 


F(x) = (FC), 0@)) 41 


= Yai (b(x;), O(X)) 44 


i=1 


= Yo ajk(x;, x). (4.6) 
i=l 


As a special case, we can easily see that the coordinate of a kernel in the feature 
space, k(x’, -) fora given x’ € X, lives in an RKHS Ho, since we have 


k(x’, x) =(k(x', +), OX) 41 = (O@'), O(X)), (4.7) 
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where the last equality comes from the definition of a kernel. Therefore, we can see 
that 


k(x’, -) = b(x’), (4.8) 


which corresponds to (4.5) with n = 1. Accordingly, we can write a kernel in terms 
of a inner product in the underlying Hilbert space: 


k(x, x") =(k(x, +), K(x", )) 94. (4.9) 
Furthermore, we can write (4.4) as follows: 


f(x) = (FO), K@, a, (4.10) 


which is known as the reproducing property [11]. 

As such, for all f(-), g(-) € Hg we can show that there exist {a;}’_, and {B;}?_, 
such that f(-) = )°;_, ajk(x;, -) and g(-) = )°}_, Bik (xi, -), since b(x) = k(x, -). 
Therefore, we often interchangeably use Ff, to denote Hy if the kernel k(x, x’) is 
specified. This leads to the explicit representation of their inner product: 


(fg) =>) Yo oi Bj (kai, -), k(x}, -)) (4.11) 
i=1 jal. 
=) > ePlenw (4.12) 


11 751 


The induced norm is then defined by 


I fll = VR fee = > Yo onerjk(xi, x4). (4.13) 


i=1 j=1 


By summarizing these findings, we are ready to provide an intuitive definition of 
RKHS. 


Definition 4.3 Let k : X x X > R be a positive definite kernel. The RKHS, Hg, 
generated by the kernel k, is a linear span of {k(x, -) : x € X} equipped with the 
inner product 


(fg) = > > ai Bjk(xi, x4), (4.14) 


i=1 fel 


where f(-) = )>j_, aik(xj, +) and g(-) = )7j_, Bik (x', -). 
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From the (classical) machine learning perspective, the most important reason to 
use the RKHS is Eq. (4.5), which states that the feature map of the target function 
can be represented as a linear span of {k(x,-) : x € X} or, equivalently, {@(x) : x € 
X}. This implies that as long as we have a sufficient number of training data, we can 
estimate the target function by estimating their feature space coordinates. 

In fact, one of the important breakthroughs of the modern neural network 
approach is to relax the assumption that the feature map of the target function should 
be represented as a linear span. This issue will be discussed in detail later. 


4.3 Representer Theorem 


Given the definition of kernels and the RKHS, the representer theorem is a simple 
consequence. Recall that in machine learning problems, the loss is defined as the 
error energy between the actual target and the estimated one. For example, in the 
linear regression problem, the MSE loss for the given training data {x;, y;}?_, is 
defined by 


lo ((xi, yi, FEL) = Do i — FDI, (4.15) 
i=l 
where 
f (xi) = (xj, B), 


with B being the unknown parameter to estimate. In the soft-margin SVM, the loss 
is given by the hinge loss: 


Ciel eae ee) = S- max{0, 1—y f(x}, (4.16) 


i=l 
where 
f (xi) = (w, x;) +d, 


with w and b denoting the parameters to estimate. For the general loss function, the 
celebrated representer theorem is given as follows: 


Theorem 4.1 [{//, 15] Consider a positive definite real-valued kernel k : X x X > 
R on a non-empty set X with the corresponding RKHS H,. Let there be given 
training data set {x;, yi}?_, with x; € X and y; € R and a strictly increasing real- 


valued regularization function R : [0,00) t» R. Then, for arbitrary loss function 
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£ ({xi, Vis For), any minimizer for the following optimization problem: 
fr = ae) €({xi, vi, FHL) + RUF lod (4.17) 
k 


admits a representation of the form 


n n 
f° = do aik(xi,-) = Yo a(x) (4.18) 
i=l i=l 
for some aj € R,i = 1,--- ,n; or it is equivalently represented by 
n 
f° @) = > wiki). (4.19) 
i=1 


The proof of the representer theorem can easily be found in the standard machine 
learning textbook [11], so we do not revisit it here. Instead, we briefly touch upon the 
main idea of the proof, since it also highlights the limitations of kernel machines. 
Specifically, the feature space coordinate of the minimizer f*, denoted by f*(-), 
should be represented by the linear combination of the feature maps from the 
training data {@(x;)}"_, and its orthogonal complement. But when we perform the 
point evaluation with {p(x;)}/_, using the inner product during the training phase, 
the contribution from the orthogonal complement disappears, which leads to the 
final form in (4.18). 


4.4 Application of Representer Theorem 


In this section, we revisit the kernel SVM and regression to show how the 
representer theorem can simplify the derivation. 


4.4.1 Kernel Ridge Regression 


Recall that the ridge regression was given by the following optimization problem: 


n 
min ) | lly; — (7, BYI? + AllB I. 
i=l 
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By extending this in nonparameteric form, the kernel ridge regression is given by 
the following minimization problem: 


n 
; 2 2 
— . x ; 4.20 
pat a lyi — FAD + AMS M5, ( ) 


where H{, is the RKHS with the positive definite kernel k. From Theorem 4.1, we 
know that the minimizer should have the form 


n 
f= do ajG(x)). (4.21) 
j=l 
Using (4.4), the MSE loss becomes 


n 


Yo ie — F@ADI? = YE lye — (FO. 6D)? 
i=l 


i=1 


= ili — Yo aj (Oj), 621)? 


i=l j=l 


n 
= yi — Do ajk(xj, xi)? 


i=1 j=l 
2 
=|ly— Kall’, 
where K € R”*” denotes the kernel Gram matrix given by 


K(x1,X;) +++ k(x1, Xn) 


K= (4.22) 


ins Xi)ce: Re. Xn) 
and 
y=[y--- yn], @=fay--- ay]. (4.23) 
Similarly, the regularization term becomes 


If lla, = (FO. FO) 


= 0 > aja; (oxi), 6 ;)) 


— 
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non 
= > So aja jk (xi, xj) 


— ie —! 


=a! Ka. 


Therefore, (4.20) can be equivalently represented by the finite dimensional opti- 
mization problem: 


& := arg min ||y — Kal|* + Aw! Ko. (4.24) 
aeR" 
The problem is convex; so using the first order necessary condition, we have 
(K* +AK)a@ = Ky. 


where we use K' = K due to the symmetry of the Gram matrix. If K is invertible 
(which is usually the case for the standard choice of kernels), we have 


&=(K+AI)"'y. 
Finally, using (4.4) and (4.21) we have 
f(x) = (FC), 6) 
= 97 0: (6(4:). 660) 
i=l 
= [k(x1,") ++ ken, x)] (K +A4D7!y, 


which is what we obtained before. 


4.4.2 Kernel SVM 

Recall that the soft-margin SVM formulation (without bias) can be represented by 
min 5|| wll? + C Doi; Cninge (vi, (W, Xi), (4.25) 

where fyinge is the hinge loss 


Lhinge (y, y) = max{0, 1 — y¥}. (4.26) 
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This problem can be solved using the representer theorem. Specifically, an extended 
formulation of (4.25) in the RKHS is given by 


are : 
amin SIF ll + € rw Chinge (Vis fi) (4.27) 
whose minimizer f has the following coordinate in the feature space: 
n 
f= Yo ajk(xj,-). (4.28) 
j=l 
Using this, the hinge loss term becomes 
n 
Chinge (Vis f (i) = max{0, 1 — y; D> a jk(xj, xi)}. (4.29) 
j=l 
Similarly, the regularization term becomes 
IF lig, = oe! Kee, 


where K is the kernel Gram matrix in (4.22). Now, (4.27) can be represented in an 
constrained form 


1 n 
MiNy,¢ 50 Ko +C &; 
i=l 


n 
subject to 1 — y; So ajk(xj, Xi) < &, (4.30) 
j=l 


& >0, Vi. 


For the given primal problem in (4.30), the corresponding Lagrangian dual is given 
by 


max g(A,y) 
Ly 


subjectto A>0, y => 0, (4.31) 
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n 


gd, y) = nip [30 Ka+ cD (4.32) 


n n n 
+> 0A — yD) okey, x1) — 8) — Do vidi fF, 
i=1 j=1 i=l 


which can be further simplified as 
1 n 
gh, y) =min} a! Ka+ )oal—&)+(C-yg&i-—r' Kot, (4.33) 
aé | 2 i=] 
where 
“a 
= [yay = Yn An] 


The first-order optimality conditions with respect to a and & lead to the following 
equations: 


Ka=Kr=>>a=r (4.34) 
and 
M+ Vi =C. (4.35) 
By plugging (4.34) and (4.35) into Eq. (4.32), we have 
n 
gh, y) = D -= ~~ Ajyiyik(Ki, kj) 
i=l j=1 
where 0 < A; < C and the classifier is given by 
n 
f(x) = Do yjAjkej, x), (4.36) 
j=l 


which is equivalent to the kernel SVM we derived before. 


4.5 Pros and Cons of Kernel Machines 


The kernel machine has many important advantages that deserve further discussion. 
This approach is based on the beautiful theory of the RKHS, which leads to the 
closed form solution in designing classifiers and regressors thanks to the representer 
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theorem. Therefore, the classical research issue is not about the machine learning 
algorithm itself, but rather to find the feature space embedding that can effectively 
represent the data in the ambient space. 

Having said this, there are several limitations associated with the classical kernel 
machines. First, the reason that enables a closed form solution in terms of the 
representer theorem is the assumption that the feature space forms an RKHS. This 
implies that the mapping from the feature space to the final function is assumed to be 
linear. This approach is somewhat unbalanced given that only the mapping from the 
ambient space to feature space is nonlinear, whereas the feature space representation 
is linear. Moreover, as discussed before, the RKHS is only a subset of underlying 
Hilbert space; therefore, restricting feature space within the RKHS severely reduces 
available function class from the underlying Hilbert space (see Fig. 4.2). As such, it 
limits the flexibility of the learning algorithm and resulting expressiveness. 

Finally, the feature mapping and the associated kernel in the classical machine 
learning approach are primarily selected in a top-down manner based on human 
intuition or mathematical modeling that has no space that can be automatically 
learned from the data. In fact, the learning part of the kernel machine is for the 
linear weighting parameters in the representer (i.e. a;’s in (4.18)), whereas the 
feature map itself is deterministic once the kernel is selected in a top-down manner. 
This significantly limits the capability of learning. Later, we will investigate how 
this limitation of the kernel machine can be mitigated by modern deep learning 
approaches. 


4.6 Exercises 


1. Show that the following kernels are positive definite. 


a. Cosine kernel: k(x, y) = cos(x — y) for Vx, y ER. 
. Polynomial kernel with degree exactly p: 


io” 


k(x, y) = (ay)? 
c. Polynomial kernel with degree up to p: 
k(x, y) = (ey +1? 
d. Radial basis function kernel with width o: 


k(x, y) = exp(—llx — y||?/(207)). 


oO 


. Sigmoid kernel: 


tanh(nx | y + v). 


4.6 
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. Let k, and kz be two positive definite kernels on a set X, and a, f two positive 


scalars. Show that wk; + Bk is positive definite. 


. Let ky be a positive definite kernel on a set X. Then, for any polynomial p(-) 


with non-negative coefficients, show that the following is also a positive definite 
kernel on a set X: 


k(x, y) = p(ki@v, y)), x, yer. 


. Let {X;}/_, be a sequence of sets and k; be a collection of corresponding 


positive definite functions on X;. Then, show that 
Kk (x1.0++ Xpi diet Vp) =k yds Kp Ap. Vp), Xie Vi E Xi, Vi 


is a kernel on the space X := X| x --- Xp. 


. Let Xo C X, then the restriction of k to Xo x Xo is also a reproducing kernel. 
. Let k be a valid kernel on X. Is the following normalized function a valid 


positive definite kernel? 


0, if k(x, x) =Oork(y, y) =0 
Knorm(x, y) = k(x. y) . Vx,y EX. 


Jeaiep? ore 


. Consider a normalized kernel k such that k(x, x) = 1 for all x € X. Define a 


pseudo-metric on X as 
dx (x, y) = ||k(x, +) — k(y, DIlae- (4.37) 
a. Show that 
dx(x, y) = 201 — k(x, y)). 


b. Show that dy(x, y) is not a metric. Which property of the metric does it 
violate? 


. Define the mean of the feature space 


by = Yo ox). 


a. Show that 


n 


1 n 
loll = > DD (xi, 2). 
i=1 j=1 
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b. Show that 
2 ! : 2 ! 2 
0} = — DMG) — My llge = — THK) — Nagle 
i=l 


where Tr(-) denotes the matrix trace, and K is the kernel Gram matrix 


k(x1, Xi) +++ k(1, Xn) 
K=| io 
K(Xn,X1) +++ K(Xn, Xn) 
9. The kernel SVM formulation in (4.27) is often called the 1-SVM. In this 
problem, we are interested in obtaining the 2-SVM, which is defined by 
1 n 
: 2 2 
fen all fll + C Do fhinge (i, fi), 
where Cs ae is the square hinge loss: 
3 a Aq\2 
Cringe 9) = (max{0, 1 — y3}) . 
Write the primal and dual problems associated with the 2-SVM, and compare 


the result with the 1-SVM. 
10. Consider the following kernel regression problem: 


1 5 n 
min = CY biocit (vi, f (Xi), 
min 5llfllae+ dX, iogit (Vis f(%i)) 
where fjogir 1s the logistic regression loss: 


Liogit(y, 9) = log +e”). 


Write the dual problems and find the solution as simply as possible. 


Part II 
Building Blocks of Deep Learning 


“T get very excited when we discover a way of making neural networks better and 
when that’s closely related to how the brain works.” 


— Geoffrey Hinton 


Chapter 5 ®) 
Biological Neural Networks sei 


5.1 Introduction 


A biological neural network is composed of a group of connected neurons. A 
single neuron may be connected to many other neurons and the total number 
of neurons and connections in a network may be significantly high. One of 
the amazing aspects of biological neural networks is that when the neurons are 
connected to each other, higher-level intelligence, which cannot be observed from 
a single neuron, emerges. The exact mechanism of the emergence of intelligence 
from the neuronal network has been an intense research topic for neuroscientists, 
biologists, and engineers, and is not yet fully understood. In fact, computational 
modeling and mathematical analysis of biological neural networks are integral parts 
of the neuroscience discipline called computational neuroscience, which is also 
closely related to the artificial neural network community. The main assumption 
in this discipline is that through the computational modeling the probable working 
mechanism of the biological network can be unveiled. Moreover, understanding the 
working principles of biological neuronal networks has been believed to open the 
horizon to designing high-performance artificial neuronal networks. 

Therefore, in this chapter, we will review the basic neurobiology regarding 
individual neurons and their networks, and introduce some interesting neuro- 
scientific discoveries that have inspired artificial neural networks. However, these 
introductory materials are by no means extensive, so interested readers are advised 
to read standard textbooks in neuroscience [17-19]. 


© The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd. 2022 79 
J.C. Ye, Geometry of Deep Learning, Mathematics in Industry 37, 
https://doi.org/10.1007/978-98 1- 16-6046-7_5 


80 5 Biological Neural Networks 


Cell body 


Nucleus Axon Telodendria 


Axon hillock 


Endoplasmic 


Golgi apparatus Synaptic 
reticulum 


terminals 


Mitochondrion 


Fig. 5.1 Anatomy of neurons 


5.2 Neurons 


5.2.1 Anatomy of Neurons 


A typical neuron consists of a cell body (soma), dendrites, and a single axon (see 
Fig. 5.1). The axon and dendrites are filaments that extrude from the cell body. 
Dendrites typically branch heavily and extend a few hundred micrometers from the 
soma. The axon leaves the soma at the axon hillock, and moves up to | m in humans 
or more in other species. The end branches of an axon are called telodendria. At 
the extreme tip of the axon’s branches are synaptic terminals, where the neuron can 
transmit a signal to another cell via the synapse. 

The endoplasmic reticulum (ER) in the soma performs many general func- 
tions, including folding protein molecules and transporting synthesized proteins 
in vesicles to the Golgi apparatus. Proteins synthesized in the ER are packaged 
into vesicles, which then fuse with the Golgi apparatus. These cargo proteins are 
modified in the Golgi apparatus and destined for secretion via exocytosis or for use 
in the cell as shown in Fig. 5.2. 


5.2.2 Signal Transmission Mechanism 


Neurons specialize in forwarding signals to individual target cells via synapses. At 
a synapse, the membrane of the presynaptic neuron comes into close proximity to 
the membrane of the postsynaptic cell (see Fig.5.3). Although there are electric 
synapses where the presynaptic and postsynaptic neurons are directly fused together 
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Fig. 5.2 ER and Golgi apparatus for protein synthesis and transport 


for fast electric signal transmission [18, 19], chemical synapses, which transmit the 
action potential via neurotransmitters, are the most common and are of great interest 
for artificial neural networks. 

As shown in Fig. 5.3, in a chemical synapse, electrical activity in the presynaptic 
neuron is converted into the release of neurotransmitters that bind to receptors 
located in the membrane of the postsynaptic cell. The neurotransmitters are usually 
packaged in a synaptic vesicle, as shown in Fig.5.3. Therefore, the amount of the 
actual neurotransmitter at the postsynaptic terminal is an integer multiple of the 
number of neurotransmitters in each vesicle, so this phenomenon is often referred to 
as quantal release. The release is regulated by a voltage-dependent calcium channel. 
The released neurotransmitter then binds to the receptors on the postsynaptic 
dendrites, which can trigger an electrical response that can produce excitatory 
postsynaptic potentials (EPSPs) or inhibitory postsynaptic potentials (IPSPs). 
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Fig. 5.3. Chemical synapse between presynaptic terminal and postsynaptic dendrite 


The axon hillock (see Fig.5.1) is a specialized part of the cell body that is 
connected to the axon. Both IPSPs and EPSPs are summed in the axon hillock and 
once a trigger threshold is exceeded, an action potential propagates through the rest 
of the axon. This switching behavior of the axon hillock plays a very important role 
in the information processing of neural networks, as will be discussed in detail later 
in Chap. 6. 


5.2.3 Synaptic Plasticity 


Synaptic plasticity is the ability of synapses to strengthen or weaken over time 
as their activity increases or decreases. In fact, synaptic plasticity is one of 
the important neurochemical foundations of learning and memory that is often 
mimicked by artificial neural networks. 

Two of the best studied forms of the synaptic plasticity in the neuronal cell are 
long-term potentiation (LTP) and long-term depression (LTD). Specifically, LTP 
is a sustained strengthening of the synapses based on recent patterns of activity. 
These are patterns of synaptic activity that cause a long-lasting increase in signal 
transmission between two neurons. The opposite of LTP is long-term depression 
(LTD), which leads to a long-lasting decrease in synaptic strength. 

In contrast to the artificial neural network, in which the synaptic plasticity 
changes are usually modeled by simple weight changes, the synaptic plastic 
change in biological neurons often results from the change in the number of 
neurotransmitter receptors located on a synapse. For example, as shown in Fig. 5.4, 
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during the LTP additional receptors are fused to the membrane by exocytosis, 
which are then moved to the postsynaptic dendrite by lateral diffusion within the 
membrane. On the other hand, in the case of LTD, some of the redundant receptors 
are moved into the endocytosis region by lateral diffusion within the membrane, and 
then absorbed by the cell via endocytosis. 

Because of the dynamics of learning and synaptic plasticity, it becomes clear that 
the trafficking of these receptors is an important mechanism to meet the demand 
and supply of the receptors at various synaptic locations in the neurons. There are 
various mechanisms that are being intensively researched by neurobiologists. For 
example, assembled receptors leave the endoplasmic reticulum (ER) and reach the 
neural surface via the Golgi network. Packets of nascent receptors are transported 
along microtubule tracks from the cell body to synaptic sites through microtubule 
networks. Figure 5.5 shows critical steps in receptor assembly, transport, intracellu- 
lar trafficking, slow release and insertion at synapses. 


5.3 Biological Neural Network 


One of the most mysterious features of the brain is the emergence of higher- 
level information processing from the connections of neurons. To understand this 
emergent property, one of the most extensively studied biological neural networks 
is the visual system. Therefore, in this section we review the information processing 
in the visual system. 


5.3.1 Visual System 


The visual system is a part of the central nervous system that enables organisms to 
process visual detail as eyesight. It detects and interprets information from visible 
light to create a representation of the environment. The visual system performs a 
number of complex tasks, from capturing light to identifying and categorizing visual 
objects. 

As shown in Fig. 5.6, the reflected light from objects shines on the retina. The 
retina uses photoreceptors to convert this image into electrical impulses. The optic 
nerve then carries these impulses through the optic canal. Upon reaching the optic 
chiasm, the nerve fibers decussate (left becomes right). Most of the optic nerve 
fibers terminate in the lateral geniculate nucleus (LGN). The LGN forwards the 
impulses to V1 of the visual cortex. The LGN also sends some fibers to V2 and V3. 
V1 performs edge detection to understand spatial organization. V1 also creates a 
bottom-up saliency map to guide attention. 
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Fig. 5.6 Anatomy of visual system and information processing 


5.3.2 Hubel and Wiesel Model 


One of the most important discoveries of Hubel and Wiesel [20] is the hierarchical 
visual information flow in the primary visual cortex. Specifically, by examining the 
primary visual cortex of cats, Hubel and Wiesel found two classes of functional 
cells in the primary visual cortex: simple cells and complex cells. More specifically, 
simple cells at V1 L4 respond best to edge-like stimuli with a certain orientation, 
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position and phase within their relatively small receptive fields (Fig. 5.7). They 
realized that such a response of the simple cells could be obtained by pooling the 
activity of a small set of input cells with the same receptive field that is observed in 
LGN cells. They also observed that complex cells at V1 L2/L3, although selective 
for oriented bars and edges too, tend to have larger receptive fields and have some 
tolerance with regard to the exact position within their receptive fields. Hubel and 
Wiesel found that position tolerance at the complex cell level could be obtained 
by grouping simple cells at the level below with the same preferred orientation but 
slightly different positions. As will be discussed later, the operation of pooling LGN 
cells with the same receptive field is similar to the convolution operation, which 
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Fig. 5.8 Hierarchical models of visual information processing 


inspired Yann LeCun to invent the convolutional neural network for handwritten zip 
code identification [21]. 

The extension of these ideas from the primary visual cortex to higher areas 
of the visual cortex led to a class of object recognition models, the feedforward 
hierarchical models [22]. Specifically, as shown in Fig.5.8, as we go from V1 
to TE, the size of the receptive field increases and the latency for the response 
increases. This implies that there is a neuronal connection along this path, which 
forms a neuronal hierarchy. A more surprising finding is that as we go along this 
pathway, neurons become sensitive to more complex inputs that are not sensitive to 
transforms. 


5.3.3 Jennifer Aniston Cell 


An extreme form or surprising example of this information processing hierarchy 
can be found in the discovery of the so-called “Jennifer Aniston Cell” [23], which 
represents a complex but specific concept or object. For those who do not know 
Jennifer Aniston, she was one of the most popular American actresses of the 1990s, 
having starred in America’s favorite sitcom, Friends. 
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Fig. 5.9 The anatomical location of the medial temporal lobe 


The study involved eight epilepsy patients who were temporarily implanted with 
a single cell recording device to monitor the activity of brain cells in the medial 
temporal lobe (MTL). The medial temporal lobe contains a system of anatomically 
related structures that are essential for declarative memory (conscious memory for 
facts and events). The system consists of the hippocampal region (Cornu Ammonis 
(CA) fields, dentate gyrus, and subicular complex) and the adjacent perirhinal, 
entorhinal, and parahippocampal cortices (see Fig. 5.9). 

During the single cell recording, the authors in [23] noticed a strange pattern on 
the medial temporal lobe (MTL) of the brain in one of their participants. Every time 
the patient saw a picture of Jennifer Aniston, a specific neuron in the brain fired. 
They tried to show the words “Jennifer Aniston,” and again it would fire. They tried 
other ways to summon Jennifer Aniston in other ways, and each time it fired. The 
conclusion was inevitable: for this particular person, there was a single neuron that 
embodied the concept of Jennifer Aniston. 

The experiment showed that individual neurons in the MTL respond to the 
faces of certain people. The researchers say that these types of cell are involved 
in sophisticated aspects of visual processing, such as identifying a person, rather 
than just a simple shape. This observation leads to a fundamental question: can a 
single neuron embody a single concept? Although this issue will be investigated 
thoroughly throughout the book, the short answer is “no” because it is not the single 
neuron in isolation, but a neuron from a densely connected neural network that can 
extract the high-level concept. 
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5.4 Exercises 


n BW 


. Explain the role of the following structure in a neuron: 


Soma 

. Dendrite 

ER 

. Golgi apparatus 
. Axon hillock 

. Synapse 


monoge 


. It is important to have a sense for the relative orders of magnitude of cellular 


components. Please specify each physical parameter for a synapse. 


a. Vesicle diameter 

b. Synapse width 

c. Vesicles released per active zone per action potential 
d. Synaptic cleft width 


. Explain the differences between electrical and chemical synapses. 
. Explain the different types of neurotransmitters and their roles. 
. Explain the differences between ionotropic receptors and metabotropic recep- 


tors. 


. Explain the mechanism of LTD and LTP. 

. What is the role of the neurotransmitter trafficking? 

. Explain the visual information processing step by step. 

. Explain why the Hubel and Wiesel model implies the convolutional processing 


in the visual cortex. 


. What is the main observation from the Jennifer Aniston cell? 


Chapter 6 m®) 
Artificial Neural Networks speck 
and Backpropagation 


6.1 Introduction 


Inspired by the biological neural network, here we discuss its mathematical 
abstraction known as the artificial neural network (ANN). Although efforts have 
been made to model all aspects of the biological neuron using a mathematical model, 
all of them may not be necessary: rather, there are some key aspects that should not 
be neglected when modeling a neuron. This includes the weight adaptation and the 
nonlinearity. In fact, without them, we cannot expect any learning behavior. 

In this chapter, we first describe a mathematical model for a single neuron, 
and explain its multilayer realization using a feedforward neural network. We then 
discuss standard methods of updating weight, often referred to as neural network 
training. One of the most important parts of neural network training is gradient 
computation, so the rest of this chapter discusses the main weight update techniques 
known as backpropagation in detail. 


6.2 Artificial Neural Networks 


6.2.1 Notation 


Since the mathematical description of an artificial neural network involves several 
indices for neuron, layers, training sample, etc., here we would like to summarize 
them for reference so that they can be used in the rest of the chapter. 

First, each training data set is usually represented as bold face lower case letters 
with the index n: for example, the following are used to indicate the n-th training- 
data-related variables: 


N 
Xn In {Xn Vntnai One Zn- 
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Second, with a slight abuse of notation, the subscript i and j for the light face lower- 
case letters denotes the i-th and j-th element of a vector: for example, 0; is the i-th 
element of the vector 0 € R¢: 


oj =[o]i, or 0=[o1 ---o4]'. 


Similarly, the double index ij indicates the (i, j) element of a matrix: for example, 
wj; is the (i, j)-th element of a matrix W € R?™¢: 


Wil eo Wig 
wij =(W)i; or W= 


Wpl ai W pq 


This index notation is often used to refer to the i-th or j-th neuron in each layer of 
a neural network. To avoid potential confusion, if we refer to the i-th element of the 
n-th training data vector x, is referred to as (x,,);. Next, to denote the /-th layer, the 
following superscript notation is used: 


2, W, ba, 


Accordingly, by combining the training index n, for example g refers to the /-th 
layer g vector for the n-th training data. Finally, the t-th update using an optimizer 
such as the stochastic gradient method can be denoted by [t]: for example, 


Olt], Vis] 


refer to the t-th update of the parameter map © and V, respectively. 


6.2.2. Modeling a Single Neuron 


Consider a typical biological neuron in Fig.6.1 and its mathematical diagram in 
Fig.6.2. Let 0j, 7 = 1,---,d denote the presynaptic potential from the j-th 
dendric synapse. For mathematical simplicity, we assume that the potential occurs 
synchronously, and arrives simultaneously at the axon hillock. At the axon hillock, 
they are summed together, and fires an action potential if the summed signal 
is greater than the specific threshold value. This process can be mathematically 
modeled as 


d 
netj =a | > wijoj +b; |, (6.1) 
j=l 
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Fig. 6.2, A mathematical model of a single neuron 


where net; denotes the action potential arriving at the i-th synaptic terminal of the 
telodendria, and b; is the bias term for the nonlinearity o (-) at the axon hillock. Note 
that the w;; is the weight parameter determined by the synaptic plasticity, and the 
positive values imply that w,j;0; are the excitatory postsynaptic potentials (EPSPs), 
whereas the negative weights correspond to the inhibitory postsynaptic potentials 
(IPSPs). 

In artificial neural networks (ANNs), the nonlinearity o(-) in (6.1) is modeled 
in various ways as shown in Fig. 6.3. This nonlinearity is often called the activa- 
tion function. Nonlinearity may be perhaps the most important feature of neural 
networks, since learning and adaptation never happen without nonlinearity. The 
mathematical proof of this argument is somewhat complicated, so the discussion 
will be deferred to later. 
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Fig. 6.3 Various forms of activation functions 


Among the various forms of the activation functions, one of the most successful 
ones in modern deep learning is the rectified linear unit (ReLU), which is defined as 
[24] 


a(x) = ReLU(x) := max{0, x}. (6.2) 


The ReLU activation function is called active when the output is nonzero. It is 
believed that the non-vanishing gradient in the positive range contributed to the 
success of modern deep learning. Specifically, we have 


OReLU(x) _ fl, ifx >0 


, (6.3) 
dc 0, otherwise 
which shows that the gradient is always | whenever the ReLU is active. Note that 
we set the gradient 0 at x = 0 by convention, since the ReLU is not differentiable at 
x=0. 
In evaluating the activation function o (x), the gain function, which refers to the 
input/output ratio, is also useful: 


v(x) := a x0. (6.4) 
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For example, the ReLU satisfies the following important property: 


do(x) _ 1, ifx>0 


ox 0, otherwise 


y(x) = (6.5) 


which will be used later in analyzing the backpropagation algorithm. 

There is an additional advantage of using the ReLU compared to other nonlineari- 
ties. As will be explained in detail later, the ReLU divides the input and feature space 
into two disjoint sets, i.e. active and inactive areas, resulting in a piecewise linear 
approximation of a nonlinear mapping onto the partitioned geometry. Accordingly, 
a neural network within each partition can be viewed as locally linear, even though 
the overall map is highly nonlinear. This is the geometric picture of a deep neural 
network that we would like to highlight for readers in this book. 


6.2.3 Feedforward Multilayer ANN 


Biological neural networks are composed of multiple neurons that are connected 
to each other. This connection can have complicated topology, such as recurrent 
connection, asynchronous connection, inter-neurons, etc. 

One of the most simple forms of the neural network connection is the multi-layer 
feedforward neural network as shown in Figs. 6.4 and 6.8. Specifically, let of ) 
denote the j-th output of the (7 — 1)-th layer neuron, which is given as the ee -th 
dendrite presynaptic potential input for the /-th layer neuron, and wl ) corresponds 
to the synaptic weights at the /-th layer. Then, by extending the model in (6.1) we 
have 


qd) 
o? = {Domes CD 4 50 |, (6.6) 
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Fig. 6.4 Examples of multilayer feedforward neural networks 
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fori = 1,--:-,d © | where d © denotes the number of dendrites of the /-th layer 
neuron. This can be represented in a matrix form 


Or 5 (Wo 4 b) (6.7) 


Dyqi-) . ‘ ‘ Ls : 
where W € R4Yx4™” is the weight matrix whose (i, j) elements are given by 


@ o(-) denotes the nonlinearity o(-) applied for each elements of the vector, and 


Wij 


0 = [oP 0%] er, (6.8) 
bY = [5.5%] eRe. (6.9) 


Another way to simplify the multilayer representation is using the hidden nodes 
from linear layers in between. Specifically, an L-layer feedforward neural network 
can be represented recursively using the hidden node g“ by 


= a(g), g = Wo!) + 7, (6.10) 


for! =1,---,L. 


6.3 Artificial Neural Network Training 


6.3.1 Problem Formulation 


N 


For given training data {xn, y,}'_1; 


formulated as follows: 


a neural network training problem can be then 


6 = arg min c(®), (6.11) 


where the cost function is given by 


N 
c(@) = YL (yn, fo(n))- (6.12) 


n=1 


Here, £ (-, -) denotes a loss function, and f@(x,) is a regression function with the 
input x,, which is parameterized by the parameter set O. 

For the case of an L-layer feedforward neural network, the regression function 
Ff (Xn) in (6.12) can be represented by 


fon) = (cog ogog*)...0g) @,), (6.13) 
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Fig. 6.5 Examples of cost functions for a 1-D optimization problem: (a) both local and global 
minimizers exists, (b) only a single global minimizer exist, (c) multiple global minimizers exist 


where the parameter set © is composed of the synaptic weight and bias for each 
layer: 


wi), p 


O= (6.14) 


Ww), bp) 


As discussed before for kernel machines in Chap.4, the formulation in (6.11) 
is so general that it covers classification, regression, etc., by simply changing the 
loss function (for example, /2 loss for the regression, and the hinge loss for the 
classification). Unfortunately, in contrast to the kernel machines, one of the main 
difficulties in the neural network training is that the cost function c(@) is not 
convex, and indeed there exist many local minimizers (see Fig. 6.5). Therefore, the 
neural network training critically depends on the choice of optimization algorithm, 
initialization, step size, etc. 


6.3.2 Optimizers 


In view of the parameterized neural network in (6.13), the key question is how the 
minimizers for the optimization problem (6.11) can be found. As already mentioned, 
the main technical challenge of this minimization problem is that there are many 
local minimizers, as shown in Fig.6.5a. Another tricky issue is that sometimes 
there are many global minimizers, as shown in Fig. 6.5c. Although all the global 
minimizers can be equally good in the training phase, each global minimizer 
may have different generalization performance in the test phase. This issue is 
important and will be discussed later. Furthermore, different global minimizers can 
be achieved depending on the specific choice of an optimizer, which is often called 
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the implicit bias or inductive bias of an optimization algorithm. This topic will also 
be discussed later. 

One of the most important observations in designing optimization algorithms is 
that the following first-order necessary condition (FONC) holds at local minimizers. 


Lemma 6.1 Letc: R? +> Rbe a differentiable function. If ©* is a local minimizer, 
then 


ec =0 (6.15) 
00 |o_-0* “= : 

Indeed, various optimization algorithms exploit the FONC, and the main dif- 
ference between them is the way they avoid the local minimum and provide fast 
convergence. In the following, we start with the discussion of the classical gradient 
descent method and its stochastic extension called the stochastic gradient descent 
(SGD), after which various improvements will be discussed. 


6.3.2.1 Gradient Descent 


N 


For the given training data {Xn, y,,},,—1> 


given by 


the gradient of the cost function in (6.12) is 


0c ) (o3 £ (Yn> fo*n)) 
dO LS) 


N 
= >> — (mn, fern), (6.16) 


which is equal to the sum of the gradient at each of the training data. Since the 
gradient is the steep direction for the increasing cost function, the steep descent 
algorithm is to update the parameter in its opposite direction: 


0 
Or +1] = Of] —7 56 ©) 
O=O0[1] 
N 
oe 
= O17 0S On )) ; (6.17) 
n=1 0=O[1] 


where 7 > O denotes the step size and O[t] is the t-th update of the parameter 
©. Figure 6.6a illustrates why gradient descent is a good way to minimize the cost 
for the convex optimization problem. As the gradient of the cost points toward the 
uphill direction of the cost, the parameter update should be in its negative direction. 
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Fig. 6.6 Steepest gradient descent example: (a) convex cases, where steepest descent succeeds, 
(b) non-convex case, where the steepest descent cannot go uphill, (c) steepest gradient leads to 
different local minimizers depending on the initialization 


After a small step, a new gradient is computed and a new search direction is found. 
By iterating the procedure, we can achieve the global minimum. 

One of the downsides of the gradient descent method is that when the gradient 
becomes zero at a local minimizers at r*, the update equation in (6.17) make the 
iteration stuck in the local minimizers, 1.e.: 


O[t+1)= Off], t=". (6.18) 


For example, Fig. 6.6b,c show the potential limitation of the gradient descent. For 
the case of Fig. 6.6b, during the path toward the global minimum, there exists uphill 
directions, which cannot be overcome by the gradient methods. On the other hand, 
Fig. 6.6c shows that depending on the initialization, different local minimizers can 
be found by the gradient descent due to the different intermediate path. In fact, the 
situations in Fig. 6.6b,c are more likely situations in neural network training, since 
the optimization problem is highly non-convex due to the cascaded connection of 
nonlinearities. In addition, despite using the same initialization, the optimizer can 
converge to a completely different solution depending on the step size or certain 
optimization algorithms. In fact, algorithmic bias is a major research topic in modern 
deep learning, often referred to as inductive bias. 

This can be another reason why neural network training is difficult and depends 
heavily on who is training the model. For example, even if multiple students are 
given the exact same training set, network architecture, GPU, etc., it is usually 
observed that some students are successfully training the neural network and others 
are not. The main reason for such a difference is usually due to their commitment 
and self-confidence, which leads to different optimization algorithms with different 
inductive biases. Successful students usually try different initializations, optimizers, 
different learning rates, etc. until the model works, while unsuccessful students 
usually stick to the parameters all the time without trying to carefully change them. 
Instead, they often claim that the failure is not their fault, but because of the wrong 
model they started with. If the training problem were convex, then regardless of the 
inductive bias they have in training, all students could be successful. Unfortunately, 
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neural network training is highly non-convex, so it is highly dependent on the 
student’s inductive bias. The good news is that once students learn how to make a 
model work, the intuition they gain from such experiences usually works for training 
more complicated neural networks. 

Indeed, advances in algorithms to optimize deep neural networks can be viewed 
as overcoming operator dependency. The following describes the various methods 
of systematically reducing the operator-dependent inductive bias for training neural 
networks, although the same problem still exists, albeit in a reduced manner, due to 
the non-convexity of the problems. 


6.3.2.2 Stochastic Gradient Descent (SGD) Method 


We say that the update equations in (6.17) are based on full gradients, since at 
each iteration we need to compute the gradient with respect to the whole data set. 
However, if 1 is large, computational cost for the gradient calculation is quite heavy. 
Moreover, by using the full gradient, it is difficult to avoid the local minimizer, since 
the gradient descent direction is always toward the lower cost value. 

To address the problem, the SGD algorithm uses an easily computable estimate 
of the gradient using a small subset of training data. Although it is a bit noisy, this 
noisy gradient can even be helpful in avoiding local minimizers. For example, let 
I[t] c {1,---, NM} denote a random subset of the index set {1,--- , N} at the ¢-th 
update. Then, our estimate of the full gradient at the t-th iteration is given by 


N ae 
= Wiel PY) (Yn> foe(Xn)) ’ (6.19) 
e=-on) EN a, 0=01/] 


where |/[t]| denotes the number of elements in /[t]. As the SGD utilizes a small 
random subset of the original training data set (i.e. |/[t]| «< N) in calculating the 
gradient, the computational complexity for each update is much smaller than the 
original gradient descent method. Moreover, it is not exactly the same as the true 
gradient direction so that the resulting noise can provide a means to escape from the 
local minimizers. 


6.3.2.3 Momentum Method 
Another way to overcome the local minimum is to take into account the previous 


updates as additional terms to avoid getting stuck in local minima. Specifically, a 
desirable update equation may be written as 


Olt + 1] = Oft]— >) BOs] (6.20) 
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SGD SGD with momentum 


Fig. 6.7 Example trajectory of update in (a) stochastic gradient, (b) SGD with momentum method 


for an appropriate forgetting factor 0 < B < 1. This implies that the contribution 
from the past gradient is gradually reduced in calculating the current update 
direction. However, the main limitation of using (6.20) is that all the history of 
the past gradients should be saved, which requires huge GPU memory. Instead, 
the following recursive formulation is mostly used which provide the equivalent 
representation: 


Vit] = 604] — Ot — 1]) —n Olt), 


O[t + 1] = Ot] + VI]. (6.21) 


This type of method is called the momentum method, and is particularly useful 
when it is combined with the SGD. The example update trajectory of the SGD with 
momentum is shown in Fig. 6.7b. Compared to the fluctuating path, the momentum 
method provides a smoothed solution path thanks to the averaging effects from the 
past gradient, which results in fast convergence. 


6.3.2.4 Other Variations 


In neural networks, several other variants of the optimizers are often used, among 
which ADAGrad [25], RMSprop [26], and Adam [27] are most popular. The main 
ideas of these variants is that instead of using the fixed step size 7 for all elements of 
the gradient, an element-wise adaptive step size is used. For example, for the case 
of the steepest descent in (6.17), we use the following update equation: 


O[t + 1] = O[t] — Y[t] © “<(@lt), (6.22) 


where Y[t] is a matrix with the step size and © is the element-wise multiplication. 
In fact, the main difference in these algorithms is how to update the matrix Y[t] 
at each iteration. For more details for specific update rules, see the original papers 
[25-27]. 
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6.4 The Backpropagation Algorithm 


In the previous section, various optimization algorithms for neural network training 
were discussed based on the assumption that the gradient 45 (O[r]) is computed. 
However, given the complicated nonlinear nature of the feedforward neural network, 
the computation of the gradient is not trivial. 

In machine learning, backpropagation (backprop, or BP) [28] is a standard way 
of calculating the gradient in training feedforward neural networks, by providing 
an explicit and computationally efficient way of computing the gradient. The term 
backpropagation and its general use in neural networks were originally derived in 
Rumelhart, Hinton and Williams [28]. Their main idea is that although the multi- 
layer neural network is composed of complicated connections of neurons with a 
large number of unknown weights, the recursive structure of the multilayer neural 
network in (6.10) lends itself to computationally efficient optimization methods. 


6.4.1 Derivation of the Backpropagation Algorithm 


The following lemma, which was previously introduced in Chap. 1, is useful in 
deriving the BP algorithm: 


Lemma 6.2 Let A € R”*”" and x € R". Then, we have 


dAx 


a ee. 2 
aVEC(A) ~ O™ e239) 


Lemma 6.3 For the vectors x € R”, y € R", we have 
Vec(xy!) = (y @ Im)x, (6.24) 


where I denotes the m x m identity matrix. 


For the derivation of the backpropagation algorithm, we tentatively assume that 
the bias terms are zero, 1.e. pO = 0, /=1,--- , L. In this case, the neural network 
parameter © in (6.14) can be simplified as 


wi) 
O= : : (6.25) 
wi) 
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where WO ¢ RIOxd) Using the denominator layout as explained in Chap. 1, 
we have 
dc 
dc aw® 
rr i ; : (6.26) 
dc 
awa 


so that the weight at the /-th layer can be updated with the increment: 


AW) 
dc 


A® = : , where AW = WO (6.27) 
AW) 


Therefore, 3c/9W should be specified. More specifically, for a given training data 
set {Xp, pe eee recall that the cost function c(®) in (6.12) is given by 


N 
c(®) = 0 (In, fon)» (6.28) 


n=1 


where f @(xX,) is defined in (6.13). Now define the /-th layer variable with respect 
to the n-th training data: 


0 =o(g), g? = wot, (6.29) 
for! = 1,--- , L, with the initialization 
a tes (6.30) 


where the bias is assumed zero. Then, we have 
of) = fo(Xn) ) 
Using the chain rule for the denominator convention (see Eq. (1.40)) 


dc(g(u)) _ du dg(u) dc(g) 
Ox ~ Ox du dg 


(6.31) 
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we have 


dC N ag” oe (vn on”) 
Oy, 7) O 
dVEC(W*’) = dVEC(W*”) Ogh 


Furthermore, Lemma 6.2 informs us 


I 
agw = 


1) 
> _ = @Iw. 6.32 
aVEc(W) d ( ) 


We further define the term: 


(6.33) 


which can be calculated using the chain rule (6.31) as follows: 


(L) 
1 141 L 
e a0 agit) aol? 26 (Yn: 0% ) 


~ ad ye L L 
ag? do\ agt ) ao” 
- AD WETDT ACD aT sis WOTAN, (6.34) 


n 


for] = 1,--- , L, and the error term €, is computed by 


a€(y,, 08”) 
= 
" don” 


In (6.34), we use 


W) ( -) 
faa 2 ihe Reo xa? (6.35) 


which is calculated using the denominator layout as explained in Chap. 1, and 


I+ i 
agit) 7 awe De® _ ort 


= = (6.36) 
ao ao 
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which is obtained using the denominator convention (see (1.41) in Chap. 1). 
Accordingly, we have 


L 
de > ag? ae (92: on ’) 
avec(W)  avec(w) ag” 


N 
= > Cee ® Tyo) 8 


n=1 


N 
= vec (3(0{'-0") 
n n ? 


n=1 


where we use (6.32) and (6.33) for the second equality, and Lemma 6.3 for the last 
equality. Finally, we have the following derivative of the cost with respect to Ww: 


0c 0c 
i = UNVEC (sar) 
aw aVec(W) 


N 
= UNVEC (> VEC (a-»")) 


n=1 
N 
— O g@-DT 
= > - 8 On ’ 
n=1 


where we use the linearity of UNVEC(-) operator for the last equality. Therefore, the 
weight update increment is given by 

dc 
aw) 


N 
=— > 8Pol-DT” (6.37) 
n=1 


AW” =—n 


6.4.2 Geometrical Interpretation of BP Algorithm 


This weight update scheme in (6.37) is the key in BP. Not only is the final form of 
the weight update in (6.37) very concise, but it also has a very important geometric 
meaning, which deserves further discussion. In particular, the update is totally 
determined by the outer product of the two terms 8 and of), Le. Eg f OT, 
Why are these terms so important? This is the main discussion point in this section. 
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First, recall that of) is the (7 — 1)-th layer neural network output given by 


(6.29). Since this term is calculated in the forward path of the neural network, it is 
nothing but the forward-propagated input to the l-th layer neuron. Second, recall 
that 


Je (On on”) 


En = 
ao 


If we use the /, loss, this term becomes 


a (Zllyn — of I?) 


En = 
aon” 


which is indeed the estimation error of the neural network output. Since we have 
69) = pOWEDT AG yaa... pwoOTaDe, (6.38) 
n n n n U % 


this implies that 6) is indeed the backward-propagated estimation error down to 
the /-th layer. Therefore, we can find that the weight update is determined by the 
outer product of the forward-propagated input and backward-propagated estimation 
error. 

In terms of calculation, the forward and backward terms 0 and 6 can be 
efficiently calculated using recursive formulae. More specifically, we have 


oD =¢ (weM gr?) (6.39) 
60 = ADpwltDT sU+D (6.40) 
n n n 7 m 
with the initialization by 
oy” =Xn, 5) = €n- (6.41) 


The geometric interpretation and recursive formulae are illustrated in Fig. 6.8. 


6.4.3 Variational Interpretation of BP Algorithm 


The variational principle is a scientific principle used within the calculus of 
variations [29], which develops general methods for finding functions that minimize 
the value of quantities that depend upon those functions. The calculus of variations 
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forward propagation of input 


AW!) = 6 ol'-1)7 5 ‘ 


backward propagation of error 


Fig. 6.8 Geometry of backpropagation 


is a field of mathematical analysis pioneered by Isaac Newton, which uses variations 
to reduce the energy function [29]. 

Given the incremental variation in (6.37), we are therefore interested in finding 
whether it indeed reduces the energy function. For this, let us consider a simplified 
form of the loss function with /y loss with N = 1. In the following, we show that 
for the case of neural networks with ReLU activation functions, the BP algorithm is 
indeed equivalent to the variational approach. 

More specifically, let the baseline energy function, which refers to the cost 
function before the perturbation, be given by 


1 
£9, 0) = sily— 0 IP, (6.42) 
where the subscript n for the training data index is neglected here for simplicity and 
Mao we”), (6.43) 


One of the important observations is that for the case of the ReLU, (6.43) can be 
represented by 


oD) := TY) gh) where g! = WHoF-) (6.44) 


where PY) « RA” xd™ ig g diagonal matrix with 0 and | values given by 
Va 0 sax 0 
ae ee Oe at 
Pete] Oem ypeer Oe Is (6.45) 


O 26+ Os-6 Yay 
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where 
y= (I¢1)) (6.46) 


where [g) ]; denotes the j-th element of the vector g)) and y(-) is defined in 
(6.4). Thanks to (6.5), we have 


PO =A, [H=1,---\L, (6.47) 


where A"? is defined as the derivative of the activation function in (6.35). Therefore, 
using the recursive formula, we have 


0) = AOw)...hOwOgl-D, (6.48) 


Using this, we now investigate whether the cost decreases with the perturbed 
weight 


AW® = —n8M9E-DT, (6.49) 


When the step size 7 is sufficiently small, then the ReLU activation patterns from 
W + AW do not change from those by W“ (this issue will be discussed later), 
so that the new cost function value is given by 


ey, o)) = lly _ AO Ww) ee AD WO + AW) o!-Y |, 
Recall that we have 
6) = 9p) y 
= Ow... AO pMgl-D _ y. 
Accordingly, we have 


ey, o)) =||— 6) ajOw... AM AW OY |P2 (6.50) 


= | — 8 4 pAOW®... APB o-DT o(-D 2 


2 
=|(7- no? Pm), 


where we use ||o/—) ||? = 0/—PTo"—) and 


MO =ADOWO... WED AO,OWEDT ... WOT AD, 
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which comes from (6.38). Now, we can easily see that for all x € Rd we have 
x Mx = |AOWEOT... wOTA x? > 0, (6.51) 


so that M“ is positive semidefinite, i.e. its eigenvalues are non-negative. Further- 
more, we have 


2 
| (7 = mle 1PM) 8 | = dba (= nll? PM) 
x 8/7, (6.52) 


where Amax(A) denotes the largest eigenvalue of A. In addition, we have 


2 
Maax (1 — nile? PM) = (1 = nllo-P [Pamax (M)) 


Therefore, if the largest eigenvalue satisfies 


2 
M 


we can show 
ty, 0) < 8? = ey, 0), 


so the cost function value decreases with the perturbation. 

It is important to emphasize that this strong convergence result is due to the 
unique property of the ReLU in (6.47), which is never satisfied with other activation 
functions. This may be another reason for the success of the ReLU in modern 
deep learning. Having said this, care should be taken since this argument is true 
only for sufficiently small step size 7, so that the ReLU activation patterns after 
the perturbation do not change. In fact, this may be another reason to choose an 
appropriate step size in the optimization algorithm. 


6.4.4 Local Variational Formulation 


Another way of understanding BP is via propagation of the cost function. As shown 
in Fig. 6.8, after the forward and backward propagation of the input and error, 
respectively, the resulting optimization problem for the weight update at the /-th 
layer is given by 


min || — 6) — Wo!) |?. (6.54) 
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Note that we have a minus sign in front of 6” inspired by its global counterpart in 
(6.50). By inspection, we can easily see that the optimal solution for (6.54) is given 
by 


1 


“Tepe. (6.55) 
Oo 


Wwr= 


since plugging (6.55) in (6.54) makes the cost function zero. Therefore, the optimal 
search direction for the weight update should be given by 


AW® = —n8M9"-DT, (6.56) 


which is equivalent to (6.49). The take-away message here is that as long as we can 
obtain the back-propagated error and the forward-propagated input, we can obtain a 
local variational formulation, which can be solved by any means. 


6.5 Exercises 


1. Derive the general form of the activation function o (x) that satisfies the following 
differential equation: 


a(x) _ da (x) 


x Ox 


2. Show that (6.21) is equivalent to (6.20). 
3. Recall that L-layer feedforward neural network can be represented recursively 
by 


@ = o(g), g® = Wo!) + b, (6.57) 
for! = 1,--- , L. When the training data size is 1, the weight update is given by 
AW® = —ybo9l-DT (6.58) 


where y > 0 is the step size and 


ae (y, 0) 


Ow 
Seer 


(6.59) 


a. Derive the update equation similar to (6.58) for the bias term, i.e. Ab”, 
b. Suppose the weight matrix W”,/ =,---,L is a diagonal matrix. Draw 
the network connection architecture similarly to Fig.6.8. Then, derive the 
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backprop algorithm for the diagonal term of the weight matrix, assuming that 
the bias is zero. You must use the chain rule to derive this. 


4. Let a two-layer ReLU neural network fg have an input and output dimension for 
each layer in R?, i.e. f @:xE Rw f e(x)«€ R?. Suppose that the parameter 
© of the network is composed of weight and bias: 


@ = (Ww, w, 6,2], (6.60) 


which are initialized as follows: 


ed ee on 
Then, for a given /2 loss function 
&(®) = sli — f@OIP (6.62) 
and a training data 
x=[1,-1]!', y=[1,0]', (6.63) 


compute the weight and bias update for the first two iterations of the backpropa- 
gation algorithm. It is suggested that the unit step size, i.e. y = 1, be used. 

5. We are now interested in extending (6.54) for the training data composed of N 
samples. 


a. Show that the following equality holds for the local variation formulation: 


N 
‘ l 1-1) 12 . l 1-1) 2 
min D7 l— 8)? — Wo, P|)? = min||— AO -WOY|;, (6.64) 
n=1 
where || - || 7 denotes the Frobenious norm and 


AMS 192299], OPV So) aro |, 


b. Show that there exists a step size y > O such that the weight perturbation 


N 
I 1) d-1)T 
AW =—y Yao 


n=1 


reduces the cost value in (6.64). 
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6. Suppose that our activation function is sigmoid. Derive the BP algorithm for 
the L-layer neural network. What is the main difference of the BP algorithm 
compared to the network with a ReLU? Is this an advantage or disadvantage? 
Answer this question in terms of variational perspective. 

7. Now we are interested in extending the model in (6.6) to a convolutional neural 
network model 


qd) 
1 Db) (-l 1 
0 =o [one of P+? |, (6.65) 
j=l 
fori = 1,---,d”, where ne is the i-th element of the filter h” = 


(1) OT 
[hy sire Ay’) . 


a. If we want to represent this convolutional neural network in a matrix form, 
o =¢ (Wo +4 b) (6.66) 


what is the corresponding weight matrix W? Please show the structure of 
W explicitly in terms of h© elements. 
b. Derive the backpropagation algorithm for the filter update Ah®, 


Chapter 7 ®) 
Convolutional Neural Networks hook for 


7.1 Introduction 


A convolutional neural network (CNN, or ConvNet) is a class of deep neural 
networks, widely used for analyzing and processing images. Multilayer perceptrons, 
which we discussed in the previous chapter, usually require fully connected 
networks, where each neuron in one layer is connected to all neurons in the next 
layer. Unfortunately, this type of connections inescapably increases the number of 
weights. In CNNs, the number of weights can be significantly reduced using their 
shared-weights architecture originated from translation invariant characteristics of 
the convolution. 

A convolutional neural network was first developed by Yann LeCun for hand- 
written zip code identification [21], inspired by the famous experiments by Hubel 
and Wiesel for a cat’s primary visual cortex [20]. Recall that Hubel and Wiesel 
found that simple cells in the primary visual cortex of a cat respond best to edge- 
like stimuli at a particular orientation, position, and phase within their relatively 
small receptive fields. Yann LeCun realized that the aggregation of LGN (lateral 
geniculate nucleus) cells with the same receptive field is similar to the convolution 
operation, which led him to construct a neural network as the cascaded applications 
of convolution, nonlinearity, and image subsampling, followed by fully connected 
layers that determine linear hyperplanes in the feature space for the classification 
tasks. The resulting network architecture, shown in Fig. 7.1, is called LeNet [21]. 

While the algorithm worked, training to learn 10 digits required 3 days! Many 
factors contributed to the slow speed, including the vanishing gradient problem, 
which will be discussed later. Therefore, simpler models that use task-specific 
handcrafted features such as support vector machines (SVMs) or kernel machines 
[11] were popular choices in the 1990s and 2000s, because of the artificial neural 
network’s (ANN) computational cost and a lack of understanding of its working 
mechanism. In fact, the lack of understanding of the ANN has been the main 
criticism of many contemporary scientists, including the famous Vladmir Vapnik, 
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Fig. 7.1 LeNet: the first CNN proposed by Yann LeCun for zip code identificiation [21] 


the inventor of the SVM. In the preface of his classical book entitled The Nature of 
Statistical Learning Theory [10], Vapnik expressed his concern saying that “Among 
artificial intelligence researchers the hardliners had considerable influence (it is 
precisely they who declared that complex theories do not work, simple algorithms 
do)”. 

Ironically, the advent of the SVM and kernel machines has led to a long period of 
decline in neural network research, often referred to as the “AI winter”. During the 
AI winter, the neural network researchers were largely considered pseudo-scientists 
and even had difficulty in securing research funding. Although there have been 
several notable publications on neural networks during the AI winter, the revival of 
convolutional neural network research, up to the level of general public acceptance, 
has had to wait until the series of deep neural network breakthroughs at the ILSVRC 
(ImageNet Large Scale Visual Recognition Competition). 

In the following section, we give a brief overview of the history of modern CNN 
research that has contributed to the revival of research on neural networks. 


7.2 History of Modern CNNs 


7.2.1 AlexNet 


ImageNet is a large visual database designed for use in visual object recognition 
software research [8]. ImageNet contains more than 20,000 categories, consisting of 
several hundred images. Since 2010, the ImageNet project has an annual software 
contest, the ImageNet Large Scale Visual Recognition Challenge (ILSVRC) [7], 
where software programs compete to correctly classify and detect objects and 
scenes. Around 2011, a good ILSVRC classification error rate, which was based 
on classical machine learning approaches, was about 27%. 

In the 2012 ImageNet Challenge, Krizhevsky et al. [9] proposed a CNN 
architectures, shown in Fig. 7.2, which is now known as AlexNet. The AlexNet 
architecture is composed of five convolution layers and three fully connected layers. 
In fact, the basic components of AlexNet were nearly the same as those of LeNet by 
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Fig. 7.2. The ImageNet challenges and the CNN winners that have completely changed the 
landscape of artificial intelligence 


Yann LeCun [21], except the new nonlinearity using the rectified linear unit (ReLU). 
AlexNet got a Top-5 error rate (rate of not finding the true label of a given image 
among its top 5 predictions) of 15.3%. The next best result in the challenge, which 
was based on the classical kernel machines, trailed far behind (26.2%). 

In fact, the celebrated victory of AlexNet declared the start of a “new era” in 
data science, as witnessed by more than 75k citations according to Google Scholar 
as of January 2021. With the introduction of AlexNet, the world was no longer 
the same, and all the subsequent winners at the ImageNet challenges were deep 
neural networks, and nowadays CNN surpasses the human observers in ImageNet 
classification. In the following, we introduce several subsequent CNN architectures 
which have made significant contributions in deep learning research. 


7.2.2 GoogLeNet 


GoogLeNet [30] was the winner at the 2014 ILSVRC (see Fig. 7.2). As the name 
“GoogLeNet” indicates, it is from Google, but one may wonder why it is not written 
as “GoogleNet”. This is because the researchers of “GoogLeNet” tried to pay tribute 
to Yann LeCun’s LeNet [21] by containing the word “LeNet’. 

The network architecture is quite different from AlexNet due to the so-called 
inception module[30], shown in Fig. 7.3. Specifically, at each inception module, 
there exist different sizes/types of convolutions for the same input and stack- 
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Fig. 7.3 Inception module in GoogLeNet 
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ing all the outputs. This idea was inspired by the famous 2010 science fiction 
film Inception, in which Leonardo DiCaprio starred. In the film, the renowned 
director Christoper Nolan wanted to explore “the idea of people sharing a dream 
space... That gives you the ability to access somebody’s unconscious mind.” The 
key concept which GoogLeNet borrowed from the film was the “dream within a 
dream” strategy, which led to the “network within a network” strategy that improves 
the overall performance. 


7.2.3 VGGNet 


VGGNet [31] was invented by the VGG (Visual Geometry Group) from University 
of Oxford for the 2014 ILSVRC (see Fig. 7.2). Although VGGNet was not the 
winner of the 2014 ILSVRC (GoogLeNet was the winner at that time, and the 
VGGNet came second), VGGNet has made a prolonged impact in the machine 
learning community due to its modular and simple architecture, yet resulting in 
a significant performance improvement over AlexNet [9]. In fact, the pretrained 
VGGNet model captures many important image features; therefore, it is still widely 
used for various purposes such as perceptual loss [32], etc. Later we will use 
VGGNet to visualize CNNs. 

As shown in Fig. 7.2, VGGNet is composed of multiple layers of convolution, 
max pooling, the ReLU, followed by fully connected layers and softmax. One of 
the most important observations of VGGNet is that it achieves an improvement 
over AlexNet by replacing large kernel-sized filters with multiple 3 x 3 kernel- 
sized filters. As will be shown later, for a given receptive field size, cascaded 
application of a smaller size kernel followed by the ReLU makes the neural network 
more expressive than one with a larger kernel size. This is why VGGNet provided 
significantly improved performance over AlexNet despite its simple structure. 
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7.2.4 ResNet 


In the history of ILSVRC, the Residual Network (ResNet) [33] is considered another 
masterpiece, as shown in its citation record of more than 68k as of January 2020. 

Since the representation power of a deep neural network increases with the 
network depth, there has been strong research interest in increasing the network 
depth. For example, AlexNet [9] from 2012 LSVRC had only five convolutional 
layers, while the VGG network [31] and GoogLeNet [30] from 2014 LSVRC 
had 19 and 22 layers, respectively. However, people soon realized that a deeper 
neural network is hard to train. This is because of the vanishing gradient problem, 
where the gradient can be easily back-propagated to layers closer to the output, 
but is difficult to be back-propagated far from the output layer since the repeated 
multiplication may make the gradient so small. As discussed in the previous chapter, 
the ReLU nonlinearity partly mitigates the problem, since the forward and backward 
propagation are symmetric, but still the deep neural network turns out to be difficult 
to train due to an unfavorable optimization landscape [34]; this issue will be 
reviewed later. 

As shown in Fig. 7.2, there exist bypass (or skip) connections in the ResNet, 
representing an identity mapping. The bypass connection was proposed to promote 
the gradient back-propagation. Thanks to the skip connection, ResNet makes it 
possible to train up to hundreds or even thousands of layers, achieving a significant 
performance improvement. Recent researches reveals that the bypass connection 
also improves the forward propagation, making the representation more expressive 
[35]. Furthermore, its optimization landscape can be significantly improved thanks 
to bypass connections that eliminate many local minimizers [35, 36]. 


7.2.5 DenseNet 


DenseNet (Dense Convolutional Network) [37] exploits the extreme form of skip 
connection as shown in Fig.7.4. In DenseNet, at each layer there exists skip 
connections from all preceding layers to obtain additional inputs. 

Since each layer receives inputs from all preceding layers, the representation 
power of the network increases significantly, which makes the network compact, 
thereby reducing the number of channels. With dense connections, the authors 
demonstrated that fewer parameters and higher accuracy are achieved compared 
to ResNet [37]. 
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Fig. 7.4 Architecture of DenseNet 
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Fig. 7.5 Architecture of U-Net 


7.2.6 U-Net 


Unlike the aforementioned networks that are designed for ImageNet classification 
task, the U-Net architecture [38] in Fig. 7.5 was originally proposed for biomedical 
image segmentation, and is widely used for inverse problems [39, 40]. 

One of the unique aspects of U-Net is its symmetric encoder—decoder architec- 
ture. The encoder part consists of 3 x 3 convolution, batch normalization [41], and 
the ReLU. In the decoder part, upsampling and 3 x 3 convolution are used. Also, 
there are max pooling layers and skip connections through channel concatenation. 

The multi-scale architecture of U-Net significantly increases the receptive field, 
which may be the main reason for the success of U-Net for segmentation, inverse 
problems, etc., where global information from all over the images is necessary to 
update the local image information. This issue will be discussed later. Moreover, 
the skip connection is important to retain the high-frequency content of the input 
signal. 
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The symmetric and multi-scale architecture of U-Net inspired many signal 
processing discoveries [42], providing important insights into understanding the 
geometry of deep neural networks. 


7.3 Basic Building Blocks of CNNs 


Although the aforementioned CNN architectures appear complicated, a closer look 
at them reveals that they are nothing but cascaded combinations of simple building 
blocks such as convolution, pooling/unpooling, ReLU, etc. These components are 
even considered as basic or “primitive” tools in signal processing. In fact, the 
emergence of the superior performance from the combination of the basic tools is 
one of the mysteries of deep neural networks, which will be discussed extensively 
later. In the meanwhile, this section provides a detailed explanation of the basic 
building blocks of CNNs. 


7.3.1 Convolution 


The convolution is an operation that originates from fundamental properties of linear 
time invariant (LTT) or linear spatially invariant (LSI) systems. Specifically, for a 
given LSI system, let h denote the impulse response, then the output image y with 
respect to the input image x can be computed by 


y=hxx, (7.1) 


where « denotes the convolution operation. For example, the 3 x 3 convolution case 
for 2-D images can be represented element by element as follows: 


1 
ylm,n]}= > Alp, alxlm — p,n — I, (7.2) 
p.q=-l 


where y[m, n], h[m, n] and x[m, n] denote the (m, n)-element of the matrices Y, H 
and X, respectively. One example of computing this convolution is illustrated in 
Fig. 7.6, where the filter is already flipped for visualization. 

It is important to note that the convolution used in CNNs is richer than the 
simple convolution in (7.1) and Fig. 7.6. For example, a three channel input signal 
can generate a single channel output as shown in Fig. 7.7a, which is often referred 
to as multi-input single-output (MISO) convolution. In another example shown in 
Fig. 7.7b, a5 x 5 filter kernel is used to generate 6 (resp. 10) output channels from 
3 (resp. and 6) input channels. This is often called the multi-input multi-output 
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Fig. 7.6 An example of convolution with 3 x 3 filter 


(MIMO) convolution. Finally, in Fig. 7.7c, the | x | filter kernel is used to generate 
32 output channels from 64 input channels. 

All these seemingly different convolutional operations can be written in a general 
MIMO convolution form: 


Cin 


WS Wiyeety, TST eeons (7.3) 
j=l 


where cjn and Coy; denote the number of input and output channels, respectively, 
x ;, y; tefer to the j-th input and the i-th output channel image, respectively, and hj, ; 
is the convolution kernel that contributes to the i-th channel output by convolving 
with the j-th input channel images. For the case of 1 x 1 convolution, the filter 
kernel becomes 


h;,; = wi;6[0, 0], 


so that (7.3) becomes the weighted sum of input channel images as follows: 


¥,= i wyxjy, F=1,-++ Cour. (7.4) 


7.3.2 Pooling and Unpooling 


A pooling layer is used to progressively reduce the spatial size of the representation 
to reduce the number of parameters and amount computation in the network. The 
pooling layer operates on each feature map independently. The most common 
approaches used in pooling are max pooling and average pooling as shown in 
Fig. 7.8b. In this case, the pooling layer will always reduce the size of each feature 
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Fig. 7.7 Various convolutions used in CNNs. (a) Multi-input single-output (MISO) convolution, 
(b) Multi-input multi-output (MIMO) convolution, (c) 1 x 1 convolution 
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Fig. 7.8 (a) Pooling and unpooling operation, (b) max and average pooling operation 


map by a factor of 2. For example, a max (average) pooling layer in Fig. 7.8b applied 
to an input image of 16 x 16 produces an output pooled feature map of 8 x 8. 

On the other hand, unpooling is an operation for image upsampling. For example, 
in a narrow meaning of unpooling with respect to max pooling, one can copy the 
max pooled signal at the original location as shown in Fig.7.9a. Or one could 
perform a transpose operation to copy all the pooled signal to the enlarged area 
as shown in Fig.7.9b, which is often called the deconvolution. Regardless of the 
definition, unpooling tries to enlarge the downsampled image. 

It was believed that a pooling layer is necessary to impose the spatial invariance 
in classification tasks [43]. The main ground for this claim is that small movements 
in the position of the feature in the input image will result in a different feature 
map after the convolution operation, so that spatially invariant object classification 
may be difficult. Therefore, downsampling to a lower resolution version of an input 
signal without the fine detail may be useful for the classification task by imposing 
invariance to translation. 
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Fig. 7.9 Two ways of unpooling. (a) Copying to the original location (unpooling), (b) copying to 
all neighborhood (deconvolution) 


However, these classical views have been challenged even by the deep learning 
godfather, Geoffrey Hinton. In “Ask Me Anything” column on Reddit he said, “the 
pooling operation used in convolutional neural networks is a big mistake and the 
fact that it works so well is a disaster. If the pools do not overlap, pooling loses 
valuable information about where things are. We need this information to detect 
precise relationships between the parts of an object...”. 

Regardless of Geoffrey Hinton’s controversial comment, the undeniable advan- 
tage of the pooling layer results from the increased size of the receptor field. 
For example, in Fig. 7.10a,b we compare the effective receptive field sizes, which 
determine the areas of input image affecting a specific point at the output image 
of a single resolution network and U-Net, respectively. We can clearly see that 
the receptive field size increases linearly without pooling, but can be expanded 
exponentially with the help of a pooling layer. In many computer vision tasks, a 
large receptive field size is useful to achieve better performance. So the pooling and 
unpooling are very effective in these applications. 

Before we move on to the next topic, a remaining question is whether there exists 
a pooling operation which does not lose any information but increases the receptive 
field size exponentially. If there is, then it does address Geoffrey Hinton’s concern. 
Fortunately, the short answer is yes, since there exists an important advance in this 
field from the geometric understanding of deep neural networks [40, 42]. We will 
cover this issue later when we investigate the mathematical principle. 
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Fig. 7.10 Receptive fields of 
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Another important building block, which has been pioneered by ResNet [33] and 
also by U-Net [38], is the skip connection. For example, as shown in Fig. 7.11, the 
feature map output from the internal block is given by 


7.3.3 Skip Connection 


Y=F(X) +x, 


where #(x) is the output of the standard layers in the CNN with respect to the input 
x, and the additional term x at the output comes directly from the input. 

Thanks to the skipped branch, ResNet [33] can easily approximate the identity 
mapping, which is difficult to do using the standard CNN blocks. Later we will 
show that additional advantages of the skip connection come from removing local 
minimizers, which makes the training much more stable [35, 36]. 
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Fig. 7.11 Skip connection in ResNet 


7.4 Training CNNs 


7.4.1 Loss Functions 


When a CNN architecture is chosen, the filter kernel should be estimated. This is 
usually done during a training phase by minimizing a loss function. Specifically, 
given input data x and its label y € R”, an average loss is defined by 


c(®) = Ele(y, fo(«))I, (7.5) 


where E[-] denotes the mean, ¢(-) is a loss function, and f @(x) is a CNN with input 
x, which is parameterized by the filter kernel parameter set O. In (7.5), the mean is 
usually taken empirically from training data. 

For the multi-class classification problem using CNNs, one of the most widely 
used losses is the softmax loss [44]. This is a multi-class extension of the binary 
logistic regression classifier we studied before. A softmax classifier produces nor- 
malized class probabilities, and also has a probabilistic interpretation. Specifically, 
we perform the softmax transform: 


efo) 


p(®) = AOL (7.6) 
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where e/) denotes the element-by-element application of the exponential. Then, 
using the softmax loss, the average loss is computed by 


c(®) =—-E bp» yj log Bi | ; (7.7) 


i=1 


where y; and p; denote the i-th elements of y and 7, respectively. If the class label 
y € R” is normalized to have probabilitistic meaning, i.e. 1' y = 1, then (7.7) is 
indeed the cross entropy between the target class distribution and the estimated class 
distribution. 

For the case of regression problems using CNNs, which are quite often used for 
image processing tasks such as denoising, the loss function is usually defined by the 
norm, i.e. 


c(®) = Elly — fe(x)|l> (7.8) 


where p = | for the J; loss and p = 2 for the /2 loss. 


7.4.2 Data Split 


In training CNNs, available data sets should be first split into three categories: 
training, validation, and test data sets, as shown in Fig. 7.12. The training data is also 
split into mini-batches so that each mini-batch can be used for stochastic gradient 
computation. The training data set is then used to estimate the CNN filter kernels, 
and the validation set is used to monitor whether there exists any overfitting issue in 
the training. 

For example, Fig. 7.13a shows the example of overfitting that can be monitored 
during the training using the validation data. If this type of overfitting happens, 
several approaches should be taken to achieve stable training behavior as shown in 
Fig. 7.13b. Such a strategy will be discussed in the following section. 
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Fig. 7.12 Available data split into training, validation, and test data sets 
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Fig. 7.13 Neural network training dynamics: (a) overfitting problems, (b) no overfitting 


7.4.3 Regularization 


When we observe the overfitting behaviors similar to Fig. 7.13a, the easiest solution 
is to increase the training data set. However, in many real-world applications, the 
training data are scarce. In this case, there are several ways to regularize the neural 
network training. 


7.4.3.1 Data Augmentation 


Using data augmentation we generate artificial training instances. These are new 
training instances created, for example, by applying geometric transformations such 
as mirroring, flipping, rotation, on the original image so that it doesn’t change the 
label information. 
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Fig. 7.14 Example of dropout 


7.4.3.2 Parameter Regularization 


Another way to mitigate the overfitting problem is by adding a regularization term 
for the original loss. For example, we can convert the loss in (7.5) to the following 
form: 


Creg(®) = Ell (y, fe(x))] + RO), (7.9) 


where R(®) is a regularization function. Recall that similar techniques were used 
in the kernel machines. 


7.4.3.3 Dropout 


Another unique regularization used for deep learning is the dropout [45]. The idea of 
a dropout is relatively simple. During the training time, at each iteration, a neuron is 
temporarily “dropped” or disabled with probability p. This means all the inputs and 
outputs to some neurons will be disabled at the current iteration. The dropped-out 
neurons are resampled with probability p at every training step, so a dropped-out 
neuron at one step can be active at the next one. See Fig. 7.14. The reason that the 
dropout prevents overfitting is that during the random dropping, the input signal for 
each layer varies, resulting in additional data augmentation effects. 


7.5 Visualizing CNNs 


As already mentioned, hierarchical features arise in the brain during visual informa- 
tion processing. A similar phenomenon can be observed in the convolution neural 
network, once it is properly trained. In particular, VGGNet provides very intuitive 
information that is well correlated with the visual information processing in the 
brain. 

For example, Fig. 7.15 illustrates the input signal that maximizes the filter 
response at specific channels and layers of VGGNet [31]. Remember that the filters 
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Fig. 7.15 Input images that maximize filter responses at specific channels and layers of VGGNet 


are of size 3 x 3, so rather than visualizing the filters, an input image where this filter 
activates the most is displayed for specific channel and layer filters. In fact, this is 
similar to the Hubel and Wiesel experiments where they analyzed the input image 
that maximizes the neuronal activation. 

Figure 7.15 shows that at the earlier layers the input signal maximizing filter 
response is composed of directional edges similar to the Hubel and Wiesel 
experiment. As we go deeper into the network, the filters build on each other and 
learn to code more complex patterns. Interestingly, the input images that maximize 
the filter response get more complicated as the depth of the layer increases. In one 
of the filter sets, we can see several objects in different orientations, as the particular 
position in the picture is not important as long as it is displayed somewhere where 
the filter is activated. Because of this, the filter tries to identify the object in multiple 
positions by encoding it in multiple places in the filter. 

Finally, the blue box in Fig.7.15 shows the input images that maximize the 
response on the last softmax level in the specific classes. In fact, this corresponds to 
the visualization of the input images that maximize the class categories. In a certain 
category, an object is displayed several times in the images. The emergence of the 
hierarchical feature from simple edges to the high-level concept is similar to visual 
information processing in the brain. 

Finally, Fig. 7.16 visualizes the feature maps on the different levels of VGGNets 
in relation to a cat picture. Since the output of a convolution layer is a 3D volume, we 
will only visualize some of the images. As can be seen from Fig. 7.16, a feature map 
develops from edge-like features of the cat to information with the lower-resolution, 
which describes the location of the cat. In the later levels, the feature map works with 
a probability map in which the cat is located. 
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Fig. 7.16 Visualization of feature maps at several channels and layers of VGGNets when the input 
image is a cat 


7.6 Applications of CNNs 


CNN is the most widely used neural network architecture in the age of modern 
AI. Similar to the visual information processing in the brain, the CNN filters are 
trained in such a way that hierarchical features can be captured effectively. This can 
be one of the reasons for CNN’s success with many image classification problems, 
low-level image processing problems, and so on. 

In addition to commercial applications in unmanned vehicles, smartphones, 
commercial electronics, etc., another important application is in the field of 
medical imaging. CNN has been successfully used for disease diagnosis, image 
segmentation and registration, image reconstruction, etc. 

For example, Fig. 7.17 shows a segmentation network architecture for cancer 
segmentation. Here, the label is the binary mask for cancer, and the backbone CNN 
is based on the U-Net architecture, where there exists a softmax layer at the end 
for pixel-wise classification. Then, the network is trained to classify the background 
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Fig. 7.17 Cancer segmentation using U-Net 
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Fig. 7.18 CNN-based low-dose CT denoising 


and the cancer regions. Very similar architecture can be also used for noise removal 
in low-dose CT images, as shown in Fig. 7.18. Instead of using the softmax layer, 
the network is trained with a regression loss of J; or /2 using the high-quality, low- 
noise images as a reference. In fact, one of the amazing and also mysterious parts of 
deep learning is that a similar architecture works for different problems simply by 
changing the training data. 

Because of this simplicity in designing and training CNNs, there are many 
exciting new startups targeting novel medical applications of AI. As the importance 
of global health care increases with the COVID-19 pandemic, medical imaging 
and general health care are undoubtedly among the most important areas of AI. 
Therefore, for the application of AI to health, opportunities are so numerous that 
we need many young, bright researchers who can invest their time and effort in AI 
research to improve human health care. 


7.7 Exercises 


1. Consider the VGGNet in Fig. 7.2. In its original implementation, the convolution 
kernel was 3 x 3. 


a. What is the total number of convolution filter sets in VGGNet? 
b. Then, what is the total number of trainable parameters in VGGNet including 
convolution filters and fully connected layers? (Hint: for the fully connected 
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layers, the number of parameters should be input dimension x output 
dimension). 


. Let your neural network code for Modified National Institute of Standards and 
Technology database (MNIST) classification be denoted by fg@(x), where O 
represents trainable parameters and x is the input image. The last layer of your 
neural network should be the softmax layer given by 


= efo®) 
P(@) = Few’ (7.10) 


where e/o) denotes the element-by-element application of the exponential. 


a. What is the meaning of the softmax layer? 
b. Suppose you define the loss function for the MNIST classifier by 


10 
c(@) =—E p i oe 71) ’ (7.11) 


i=1 


where p; denotes the i-th element of p. Then, what is {y;}/2,? Provide 
answers when the label has the values 1 and 5. 


. For the given U-Net architecture in Fig.7.5, compute the effective receptive 
field size. Now, suppose that there exist no pooling layers. What is the effective 
receptive field size? 

. Let u = [u[0],--- ,u[n —1]]’ € R” and v = [v[0],--- , v[n — 1]]' € R”. We 
define a circular convolution between the two vectors: 


n—-1 


(u® v)[n] = } J uln — ijvlnl, 


i=0 


where the periodic boundary condition is assumed. Now, for any vector x € R”! 
and y € R” with nj, n2 < m, define their circular convolution in R”: 


x@®y=x@y?, 


where x? = [x,0"-"']! and y® = [y,0”~"2]". Finally, for any v € R™ with 
ni <n, define the flip 0[n] = v°[—n]. 


a. For an input signal x € R” and a filter fw € R”, show that 


y=x@v=H(x)y, (7.12) 
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where Hi (x) € R”*" is a wrap-around Hankel matrix: 


xO] x[1] ---x[r- 1] 
x{1]) x[2]--- x[r] 
Hays] 0 OY, (7.13) 
x[n 7 1] Ai +--+ x[r _ 2] 


b. For an input signal x € R” anda filter wy € R’ withr <n, show the following 
commutative relationship for the circular convolution in R”: 


x@W =H (x)¥ =H (W)x = v @X. (7.14) 
c. Fora given f,u € R” and v € R’ withr <n, show that 
ul Fv=u!' (f@d)=f! (u®v)=(f,u®vd), (7.15) 


where F = Hi (f). 
d. Let the multi-input single-output (MISO) circular convolution for the p- 


channel input Z = [z1,--- , Zp] € R”*? and the output y € R” be defined 
by 
PB = 
y=) z OVP’, (7.16) 
j=l 


where w; € R’ denotes a r-dimensional vector and w; € R” refers to its flip. 
Then, show that (7.16) can be represented in a matrix form: 


y=Z@W=H,,(Z)Y, (7.17) 
where 
y! 
w=] : 
wP 
and 


Hyp (Z) = [HP (21) Heo) --- He (p)].- (7.18) 
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e. Let the multi-input multi-output (MIMO) circular convolution for the p- 


channel input Z = [z1,---,.Zp] € R”*? and q-channel output Y = 
[Vise Nal R"*7 be defined by 


p 
j=l 


where p and q are the number of input and output channels, respectively; 
W;,; € R" denotes a r-dimensional vector and w; ; € R” refers to its flip. 
Then, show that (7.19) can be represented in a matrix form by 


Pp 
Y= SH )W; = Hi, (2), 
j=l 


where 


w=| : where Wj = [vi aoe ai | . 
Vv, 


. In convolutional neural networks (CNNs), a 1x1 convolution often follows 


the convolution layer. For 1-D signals, this operation can be written as 


P 
Ji = 04) (<i @¥i;). i=1,---,4, (7.20) 


j=l 


where w; denotes the j-th index of 1x1 convolution filter weighting. Show 
that this can be represented in a matrix form by 


Pp 
Y = )0 wih (2))¥; = Hy, (Z)¥”, (7.21) 
j=l 


where 
wi Vy 
we = : F (7.22) 
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Chapter 8 m®) 
Graph Neural Networks sei 


8.1 Introduction 


Many important real-world data sets are available in the form of graphs or networks: 
social networks, world-wide web (WWW), protein-interaction networks, brain 
networks, molecule networks, etc. See some examples in Fig.8.1. In fact, the 
complex interaction in real systems can be described by different forms of graphs, 
so that graphs can be a ubiquitous tool for representing complex systems. 

A graph consists of nodes and edges as shown in Fig. 8.2. Although it looks 
simple, the main technical problem is that the number of nodes and edges in 
many interesting real-world problems is very large, and cannot be traced by simple 
inspection. Accordingly, people are interested in different forms of machine learning 
approaches to extract useful information from diagrams. 

With a machine learning tool, for example, a node classification can be carried 
out in which different labels are assigned to each node in a complex diagram. This 
could be used to classify the function of proteins in the interaction network (see 
Fig. 8.3a). Link analysis is another important problem in graph machine learning, 
which is about finding missing links between nodes. As shown in Fig. 8.3b, link 
analysis can be used for repurposing drugs for new types of pathogens or diseases. 
Yet another important goal of graph analysis is community detection. For example, 
one could identify a subnetwork that consists of disease proteins (see Fig. 8.3c). 

Despite the wide range of possible applications, approaches to neural networks in 
graphs are not as mature as other studies of neural networks for images, voices, etc. 
This is because the processing and learning of graph data require new perspectives 
on neural networks. 

For example, as shown in Fig. 8.4, the basic assumption of convolutional neural 
networks (CNNs) is that images have pixel values on regular grids, but graphs 
have irregular node and edge structure so that the applications of basic modules 
such as convolution, pooling, etc., are not easy. Another serious problem is that, 
although CNN training data consists of images or their patches of the same size, the 
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Homo sapiens hINO80 PPI network using STRING-DB.ORG 


(a) (—) 
© 


2) (9) © © © 


Fig. 8.1 Examples of graphs in real life 


Nodes e@ee8@ 


Edges 


Fig. 8.2. Nodes and edges in a graph 


training data of the graph neural network usually consists of graphs with different 
numbers of nodes, network topology, and so on. For example, in graphical neural 
network approaches for examining the toxicity of drug candidates, the chemicals 
in the training data set can have a different number of molecules. This leads to the 
fundamental question in the graph machine learning task: What do we learn from 
the training data? 

In fact, the main advantage of neural network approaches over other machine 
learning approaches like compressed sensing [46] and low-rank matrix factorization 
[47], etc. is that the neural network approaches are inductive, which means that the 
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Fig. 8.3 Several application goals of machine learning on graphs: (a) node classification, (b) link 
analysis, (c) community detection 


e@eee 8 
vs @eee@ 
Gua ccccec 
@eeee 
@eeee 
Graph Images 


Fig. 8.4 Difference between image domain CNN and graph neural network 


trained neural network is not just applied to the data on which the network resides 
and was originally trained, but also to other unseen data during training. 

However, given that each graph in training data is different in its structure (for 
example, with different node and edge numbers and even topology), what kind of 
inductive information can we get from the graph neural network training? Although 
the universal approximation theorem [48] guarantees that neural networks can 
approximate any nonlinear function, it is not even clear which nonlinear function 
a graph neural network tries to approximate. 

Hence the main aim of this chapter is to answer these puzzling questions. In fact, 
we will focus on how machine learning researchers came up with brilliant ideas to 
enable inductive learning independent of different graph structures in the training 
phase. 


8.2. Mathematical Preliminaries 


Before we discuss graph neural networks, we review basic mathematical tools from 
graph theory. 
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Fig. 8.5 Examples of graphs 
and their adjacency matrices 


0101 0010 0100 
1001 0011 1010 
0001 L1iQi 0101 
1110 0110 0010 


8.2.1 Definition 


We denote a graph G = (V, E) with a set of vertices V(G) = {1,--- , N} with 
N := |V| and edges E(G) = {e;;}, where an edge e;; connects vertices i and j if 
they are adjacent or neighbors. The set of neighborhoods of a vertex v is denoted 
by Av). For weighted graphs, the edge e;; has a real value. If G is an unweighted 
graph, then E is a sparse matrix with elements of either 0 or 1. 

For a simple unweighted graph with vertex set V, the adjacency matrix is a square 
|V|x|V| matrix A such that its element a,,, is one when there is an edge from vertex 
u to vertex v, and zero when there is no edge. See Fig. 8.5 for some examples of 
adjacency matrices for undirected graphs. Note that the dimension of the adjacency 
matrix varies depending on the number of nodes in the graph. 


8.2.2 Graph Isomorphism 


A graph can exist in different forms having the same number of vertices, edges, 
and also the same edge connectivity. Such graphs are called isomorphic graphs. 
Formally, two graphs G and H are said to be isomorphic if (1) their numbers 
of components (vertices and edges) are equal, and (2) their edge connections are 
identical. Some examples of isomorphic graphs are shown in Fig. 8.6. 

Graph isomorphism is widely used in many areas where identifying similarities 
between graphs is important. In these areas, the graph isomorphism problem is often 
referred to as the graph matching problem. Some practical uses of graph isomor- 
phism include identifying identical chemical compounds in different configurations, 
checking equivalent circuits in electronic design, etc. 

Unfortunately, testing graph isomorphism is not a trivial task. Even if the number 
of nodes is the same, two isomorphic graphs, for example, can have different 
adjacency matrices, since the order of the nodes in the isomorphic graph can be 
arbitrary, but the structure of their adjacency matrices is critically determined by 
the order of the nodes. In fact, the graph isomorphism problem is one of the few 
standard problems whose complexity remains unsolved. 
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Fig. 8.6 Examples of 
isomorphic graphs. All four 
graphs are isomorphic to each 
other 


8.2.3. Graph Coloring 


A node coloring is a function V(G) t+» & with arbitrary codomain &. Then a 
node colored or colored graph (G,/) is a graph G endowed with a node coloring 
1: V(G) + &. We say that /(v) is a color of v € V(G). 

Figure 8.7 shows an example of graph coloring in a molecular system [49]. In 
the initial phase, each node is colored with feature vectors that consist of various 
chemical properties. In this case, the codomain is © C R°. Using machine learning 
approaches, the node colors can be updated sequentially by taking into account the 
color information of neighboring nodes to extract useful global properties of the 
molecule. 


8.3. Related Works 


Since each diagram in the training data has a different configuration, the main 
concern of machine learning of graphs is to assign latent vectors in the common 
latent space to graphs, subgraphs, or nodes so that standard CNN, perceptron, etc. 
can be applied to the latent space for inference or regression. This procedure is often 
called graph embedding, as shown in Fig. 8.8. One of the most important research 
topics in graph neural networks is to find an inductive rule for the graph embedding 
that can be applied to graphs with a different number of nodes, topologies, etc. 

Unfortunately, one of the difficulties associated with the graphs is that they are 
unstructured. In fact, there is a lot of unstructured data that we encounter in everyday 
life, and one of the most important classes of unstructured data is natural language. 
Therefore, many of the graphics machine learning techniques are borrowed from 
natural language processing (NLP). So this section explains the key idea of natural 
language processing. 
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Fig. 8.7 Node coloring example in a molecular system. (a) Initial coloring with feature vectors, 
(b) its successive update using a machine learning approach 
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Fig. 8.8 Concept of graph embedding to a latent vector 


8.3.1 Word Embedding 


Word embedding is one of the most popular representations for natural language 
processing. Basically, it is a vector representation of a particular word that can 
capture the context of a word in a document, its semantic and syntactic similarity, 
its relationship to other words, and so on. 

For example, consider a vocabulary “king”. From its semantic meaning, one 
could come to the following conclusion: 


King — Man + Woman = Queen. (8.1) 
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Fig. 8.9 Example of vector Ww King) 
operation via word 
embedding 


vV(Queen) 


V(Man) 


V(KING)-V(Man) 


v(Woman) 


However, there is no mathematical operation in natural language to formally derive 
(8.1). Hence, the idea of word embedding is to perform this operation through vector 
operations in latent space. Specifically, let V(-) denote a mapping of a vocabulary 
to a vector in R¢. Then, the goal of the word embedding is to find the mappings V 
so that 


V(King) — V(Man) + V(Woman) = V(Queen). (8.2) 


This concept is illustrated in Fig. 8.9. There are several ways to embed a word. The 
main problem here is to represent each word in large text as a vector so that similar 
words are close together in latent space. 

Among the various ways of performing word embedding, the so-called word2vec 
is one of the most frequently used methods [50, 51]. Word2vec is composed of 
a two-layer neural network. The network is trained in two complementary ways: 
continuous bag of words (CBOW) and skip-gram. The key idea of these approaches 
is that there are significant causal relationships and redundancies between words in 
natural languages, the information of which can be used to embed words in vector 
space. In the following, we describe them in detail. 


8.3.1.1 CBOW 


CBOW begins with the assumption that a missing word can be found from its 
surrounding words in the sentence. For example, consider a sentence: The big dog 
is chasing the small rabbit. The idea of CBOW is that a target word in the sentence 
(which is usually the center word), for example, “dog” as shown in Fig. 8.10, can be 
estimated from the nearby words within the context window (for example, using 
“big” and “is” for the case of context window size c = 1). In general, for a 
given context window size c, the i-th word x; is assumed to be estimated using 
the adjacent words within a window, ie. {xj | j € L-(i)}, as shown in Fig. 8.10, 
where 


E.@i) := {fi-c,--- ,t-1,it1,---,it+c}. (8.3) 
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Center Word: @ 
Context Word: — 


c=0 The big @6@ is chasing the small rabbit. 
c=1 The big @6@ is chasing the small rabbit. 
c=2 big @6g is chasing the small rabbit. 


Fig. 8.10 Example of context and center words in CROW 
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Fig. 8.11 Encoder—decoder structure of CBOW 


Now, here comes the fun part. In CBOW, rather than directly estimating the word 
x;, it employs an encoder-decoder structure as depicted in Fig. 8.11. Specifically, an 
encoder, represented by the shared weight W, converts input x,, into a corresponding 
latent space vector, and then the decoder with the weight W converts the latent 
vector into the estimate of the target word x;. 

Furthermore, one of the most important assumptions of CBOW is that the latent 
vector of the missing word is represented as the average value of the latent vectors 
of the adjacent words, i.e. 


1 
h; = a4 25 Wx. (8.4) 
kel ,(i) 
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Specifically, using the 2c — | input vectors and the shared encoder weight, we 
generate 2c — | latent vectors, after which their average value is generated. Then, 
the center word is estimated by decoding from the averaged latent vector with the 
weight W: 


£;— Why. (8.5) 


Note that other than the softmax unit in the network output, which will be explained 
later, there are no non-linearities in the hidden layer of CBOW. 

To start off, one should first build the corpus vocabulary, where we could map 
each vocabulary to a unique numeric identifier x;. For example, if the corpus size is 
M, then x; is an M-dimensional vector with one-hot vector encoding as shown in 
Fig. 8.12. Once the neural network in CBOW is trained, the word embedding can be 
simply done using the encoder part of the network. 

The very strict assumption that the center word may be similar to the average of 
the surrounding vocabularies in the latent space works amazingly well, and CROW 
is one of the most popular classical word embedding techniques [50, 51]. 


8.3.1.2 Skip-Gram 


Skip-gram can be seen as a complementary idea of CBOW. The main idea behind 
the skip-gram model is this: once the neural network is trained, the latent vector 
generated by the focus word can predict every word in the window with high 
probability. For example, Fig. 8.13 shows the example of how we extract the focus 
word and the target word within different window sizes. Here the green word is the 
focus word from which the target words in the window are estimated. 
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Fig. 8.13 The focus and target vocabularies in skip-gram training 


Similar to CBOW, the neural network training is carried out in the form of latent 
vectors. In particular, the focus word encoded with a one-hot vector is converted to 
a latent vector using an encoder with the weight W, and then the latent vector is 


decoded via a parallel decoder network with the shared weight w'. as shown in 
Fig. 8.14. So the basic assumption of skip-gram can be written by 


xj~ Wh, Vj €L(), (8.6) 
where the latent vector h; is given by 
h; = Wx;. (8.7) 


Again, there are no non-linearities in the hidden layer of skip-gram other than the 
softmax unit in the network output. 


8.3.2 Loss Function 


The loss function for the neural network training in word2vec deserves further 
discussion. Similar to the classification problem, the loss function is based on the 
cross entropy between the target word and the generated word from the decoder. 
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Fig. 8.14 Encoder—decoder 
structure of skip-gram 
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In the case of CBOW in particular, it should be remembered that the target 
vector x; is also a one-hot encoded vector. Let t, denote the nonzero index of the 
vocabulary vector x,;. Then, the loss function of CBOW can be written as a softmax 


function: 


M 
Tp 
=~, hi + log (>: <**) . (8.8) 
where the latent vector h; is given by the average latent vector in (8.4). On the other 


hand, the loss function for the skip-gram is given by 


C wy) hj 


~ efi 
lskipgram(W, W)= — log | | > 
M ow, hj 
jer) Dae * 


M 
=- > wh +Ciog( Se), (8.9) 


Jel e(i) k=1 


where the latent vector h; is given by (8.7). 
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In both approaches, the computationally intensive step is the calculation of the 
denominator terms, since we have to calculate them for each corpus of size M. 
One of the main research efforts is to approximate this term without sacrificing the 
accuracy [50, 51]. 


8.4 Graph Embedding 


Similar to word embedding, graph embedding is used to convert nodes, subgraphs, 
and their features into vectors in latent space so that similar nodes, subgraphs, and 
features are close together in latent space. 

As summarized in Fig. 8.15, currently there exist three types of approaches 
for graph embedding: matrix factorization, random walks, and neural network 
approaches [52]. In the following, we first briefly review the first two approaches, 
then we discuss neural network approaches in detail. 


8.4.1 Matrix Factorization Approaches 


The main assumption of matrix factorization approaches for graph embedding is that 
an adjacency matrix can be decomposed into low rank matrices. More specifically, 
for a given adjacency matrix A €¢ R%*Y, its low rank matrix decomposition is to 
find U, V e RN*@ such that 


AxUV', (8.10) 
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Low-dimensional 
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Graph embedding methods 


Fig. 8.15 Various approaches for graph embedding 
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where d is the latent space dimension. Then, the i-th node embedding in the latent 
space R@ is given by 


h; = V'x; €R¢, 


where x; € R’ is again the one-hot vector encoded i-th node vector. 

Aside from the computational complexity of matrix decomposition, there are 
several limitations in matrix factorization approaches as a graph embedding. First, 
to use a matrix factorization approach, the number of the nodes should be the same. 
Second, the approach is not inductive, but rather transductive. This means that the 
learned embedding transform only works for the graph with the same adjacency 
matrix and if the connectivity changes, the embedding does not hold anymore. 


8.4.2 Random Walks Approaches 


Random walks approaches for graph embedding are very closely related to the word 
embedding, in particular, word2vec [50, 51]. Here, we review two powerful random 
walk approaches: DeepWalks [53] and node2vec [54]. 


8.4.2.1 DeepWalks 


The main intuition of DeepWalks [53] is that random walks are comparable to 
sentences in the word2vec approach so that word2vec can be used for embedding 
each node of a graph. More specifically, as depicted in Fig.8.16, the method 
basically consists of three steps: 


e Sampling: A graph is sampled with random walks. A few random walks with 
specific length are performed from each node. 

¢ Training skip-gram: The skip-gram network is trained by accepting a node from 
the random walk as a one-hot vector as an input and target. 

¢ Node embedding: From the encoder part of the trained skip-gram, each node in 
a graph is embedded into a vector in the latent space. 


g S Z 3 Embedding space 
@-e-0-0 “SX 
Step 2 Step 2 -< Step 3 
Sampling Training a @ computing 
random @- -@-O0 skip-gram embeddings 
walks e@-e-o-8 model 


Biomedical network 


Fig. 8.16 Graph node embedding using DeepWalks 
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Fig. 8.17 BFS and DFS random walks in node2vec 


8.4.2.2 Node2vec 


Node2vec is a modification of DeepWalks with subtle but significant differences. 
Node2vec is parameterized by two parameters p and q. The parameter p prioritizes 
a breadth-first-search (BFS) procedure, while the parameter g prioritizes a depth- 
first-search (DFS) procedure. The decision of where to walk next is therefore 
influenced by probabilities 1/p or 1/g. As shown in Fig.8.17, BFS is ideal 
for learning local neighbors, while DFS is better for learning global variables. 
Node2vec can switch to and from the two priorities depending on the task. Other 
procedures, such as the use of skip-gram, are exactly the same as DeepWalks. 


8.4.3 Neural Network Approaches 


Recently, there has been significant progress and growing interest in graph neural 
networks (GNNs), which comprise graph operations performed by deep neural net- 
works. For example, spectral graph convolution approaches [55], graph convolution 
network (GCN) [56], graph isomorphism network (GIN) [57], graphSAGE [58], to 
just name a few. 

Although these approaches have been derived from different assumptions and 
approximations, common GNNs typically integrate the features on each layer 
in order to embed each node features into a predefined feature vector of the 
next layer. The integration process is implemented by selecting suitable functions 
for aggregating features of the neighborhood nodes. Since a level in the GNN 
aggregates its l-hop neighbors, each node feature is embedded with features in 
its k-hop neighbor of the graph after k aggregation layers. These features are then 
extracted by applying a readout function to obtain a nodal embedding. 
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Specifically, let x denote the f-th iteration feature vector at the v-th node. 
Then, this graph operation is generally composed of the AGGREGATE, and 
COMBINE functions: 


a\) = AGGREGATE({[x¢-) : we No|!), 


xO = COMBINE(x-?, a‘), 


where the AGGREGAT E function collects features of the neighborhood nodes to 
extract the aggregated feature vector a, and COMBINE function then combines 


the previous node feature xf) with aggregated node features a? to output the 


node feature x. 

One of the most important considerations in designing GNNs as a graph 
embedding method is that the AGGREGATE function is a function of {{-}} 
that denotes the multiset. Multiset is a set (a collection of elements where the 
order is not important) where elements may appear multiple times. Therefore, the 
AGGREGATE function should be operated with various sets of nodes and should 
be independent of the order of the elements in the sets. 

The importance of the condition is well illustrated in Fig. 8.18. For example, at 
t = 1, each node has distinct set of neighborhood nodes, so the neural network 
should be applicable for all these node configurations with the shared weight. 
Similar situations can happen at t = 2, since the nodes A and B have three and 
two connecting nodes, respectively. One simple example of an AGGREGATE 


OOO C00@e 00 
Neural model for node A Neural model for node B 


Fig. 8.18 Example of aggregation function operation ina GNN 
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function that satisfies this requirement is a sum operation: 


a® = AGGREGATE({xe- 1ue Nw)}}) 


= > gD. (8.11) 


ueN(v) 


Although this sum operation is one of the most popular approaches in GNNs, we 
can consider a more general form of the operation with desirable properties. This is 
the main topic in the following section. 


8.5 WL Test, Graph Neural Networks 


Compared to the matrix factorization and random walks approaches, the success 
of graph embedding using neural networks appears mysterious. This is because in 
order to be a valid embedding, the semantically similar input should be closely 
located in the latent space, but it is not clear whether the graph neural network 
produces such behaviors. 

For the case of matrix factorization, the embedding transform is obtained from 
the assumption that the latent vector should live in the low-dimensional subspace. 
For the case of random walks, the underlying intuition for the embedding is similar 
to that of word2vec. Therefore, these approaches are guaranteed to retain semantic 
information in the latent space. Then, how do we know that the neural-network- 
based graph embedding also conveys the semantic information? 

This understanding is particularly important because a GNN algorithm is usually 
designed as an empirical algorithm and not based on the top-down principle in 
order to achieve the desired embedding properties. Recently, a number of authors 
[57, 59-62] has shown that the GNN is indeed a neural network implementation 
of Weisfeiler-Lehman (WL) graph isormorphism test [63]. This implies that if the 
embedding vectors of a GNN are distinct from each other, then the corresponding 
graphs are not isormophic. Therefore, GNNs may retain useful semantic information 
during the embedding. In this section, we review this exciting discovery in more 
detail. 


8.5.1  Weisfeiler-Lehman Isomorphism Test 


As discussed before, determining whether two graphs are isomorphic is a challeng- 
ing problem. It is not even known whether there is a polynomial time algorithm for 
determining whether graphs are isomorphic. 

In this sense, the Weisfeiler-Lehman (WL) algorithm [63] is a mechanism to 
efficiently assign fairly unique attributes. The core idea of the Weisfeiler-Lehman 
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isomorphism test is to find a signature for each node in each graph based on 
the neighborhood around the node. These signatures can then be used to find the 
correspondence between nodes in the two graphs. Specifically, if the signatures of 
two graphs are not equivalent, then the graphs are definitively not isomorphic. 

We now describe the WL algorithm formally. For a given colored graph G, 
the WL computes a node coloring co) : V(G)  &, depending on the coloring 
from the previous iteration. To iterate the algorithm, we assign each node a tuple 
that contains the old compressed label (or color) of the node and a multiset of the 
compressed labels (colors) of the neighbors of the node: 


m® = ee ew lue Nw) (8.12) 


where {-} denotes the multiset, which is a set (a collection of elements where order 
is not important) in which elements may appear more than once. Then, HASH (-) 
bijectively assigns the above pair to a unique compressed label that was not used in 
previous iterations: 


e+) — HASH (m\”) (8.13) 


If the number of colors does not change between two iterations, then the algorithm 
ends. This procedure is illustrated in Fig. 8.19. 

To test two graphs G and H for isomorphism, we run the above algorithm 
in “parallel” on both graphs. If the two graphs have different numbers of nodes, 
which are colored in the WL algorithm, it is concluded that the graphs are not 
isomorphic. In the algorithm described above, the “compressed labels” serve as 
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Fig. 8.19 WL algorithm for graph isormorphism test 
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signatures. However, it is possible that two non-isomorphic graphs have the same 
signatures, so this test alone cannot provide conclusive evidence that two graphs are 
isomorphic. However, it has been shown that the WL test can be successful in the 
graph isomorphism test with a high degree of probability. This is the main reason 
the WL test is so important [63]. 


8.5.2 Graph Neural Network as WL Test 


Recall that a GNN computes a sequence Oey for t > 0 of vector embeddings 


of a graph G = (V, E). In the most general form, the embedding is recursively 
computed as 


a” = AGGREGATE({ xe"? Due Nw)|}), (8.14) 


where {-} is the multi-set and the aggregation function is symmetric in its arguments, 
and the updated feature vector is given by 


x = COMBINE(x{-”, al). (8.15) 


From (8.14) and (8.15) in comparison with (8.12) and (8.13), if we identify x) 
as the coloring at the f¢-th iteration, i.e. cf, then we can see that there are 
remarkable similarities between GNN updates and the WL algorithm in terms of 
their arguments, which are made up of multiset neighborhoods and the previous 
node. In fact, these are not incidental findings; there is a fundamental equivalence 
between them. 

For example, in graph convolutional neural networks (GCNs) [56] and graph- 
SAGE [58], the AGGREGAT E function is given by an average operation, whereas 
it is just a simple sum in the graph isormorphism network (GIN) [57]. One could use 
the element-by-element max operation as the AGGREGATE function, or even a 
long short-term memory (LSTM) can be used [58]. Similarly, a simple sum followed 
by a multilayer percentron (MLP) can be used as the COMBINE function, or the 
weighted sum or concatenation followed by an MLP could be used [58, 59]. In 
general, the GNN operation can be represented by 


xi) So(wPs+ D> wPx®), (8.16) 
uEeN(v) 


for some matrices wi) ; wy? and the nonlinearity o(-) [59]. One of the important 
discoveries in [59] is that for a given coloring oD }vev, there always exist 


matrices wi and ws? which makes the update (8.16) equivalent to the WL 
algorithm in (8.12) and (8.13). Therefore, the GNN is indeed a neural network 
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implementation of the WL algorithm for the graph isomorphism test, and the way 
GNNs produce node embedding is to map the graph to a signature that can be used 
to test the graph matching. 


8.6 Summary and Outlook 


So far we have discussed the graphical neural network approach as a modern method 
of performing graph embedding. The most important finding is that the GNN is 
actually a neural network implementation of the WL test. Therefore, GNN fulfills 
the important properties of embedding: if the two feature vectors in latent space are 
different, the underlying graph is different. 

The embedding of the graph with GNNs is by no means complete. In order to get 
a really meaningful graph embedding, the vector operation in latent space should 
have the same semantic meaning as in the original diagram, similar to that of word 
embedding. However, it is still not clear whether the current GNN-based embedding 
of graphs can lead to such versatile properties. 

Hence, the field of graphic neural networks is still a wide open area of research 
and the next level of breakthroughs will require many good ideas from young and 
enthusiastic researchers. 


8.7. Exercises 


1. Show that every connected graph with n vertices has at least n — | edges. 

2. For the case of CBOW, recall that the target vector x; is also a one-hot encoded 
vector. Let t; denote the nonzero index of the vocabulary vector x,. Then, show 
that the loss function of CBOW can be written as a softmax function: 


M 
= —0/ hj + log e sin) ; (8.17) 


where the latent vector h; is given by the average latent vector. 

3. Classify, up to isomorphism, all connected graphs (simple or not simple) with 5 
vertices and 5 edges. You may find that every simple, connected graph with 5 
vertices and 5 edges is isomorphic to exactly one of the five cases. 

4. Let G be a graph with 4 connected components and 20 edges. What is the 
maximum possible number of vertices in G? 
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5. The GIN was proposed as a special case of spatial GNNs suitable for graph clas- 
sification tasks. The network implements the aggregate and combine functions 
as the sum of the node features: 


(k) _ (k) (k)) , y(kK-1) (k—-1) 
x® = MLP (+e aa) Dae ; (8.18) 


where e = 0.1, and MLP is a multilayer perceptron with ReLU nonlinear- 
ity. 


a. Draw the corresponding graph, whose adjacency matrix is given by 


0110 
1011 
1100 
0100 


b. Suppose that the input node feature is a one-hot feature matrix: 


1000 
0100 
0010 
0001 


xO — 


and the MLP weight matrix W“) = W™ is given by 


0.1 —0.2 —0.3 0.4 
—0.1 0.2 —0.3 0.4 
04 03 0.2 —0.1 
—0.4 0.3 0.2 —0.1 


wh) — 


Then, obtain the next layer feature matrices X‘) and X) assuming that there 
exists no bias at each MLP. 


Chapter 9 Mm) 
Normalization and Attention hook for 


9.1 Introduction 


In this chapter, we will discuss very exciting and rapidly evolving technical fields of 
deep learning: normalization and attention. 

Normalization originated from the batch normalization technique [41] that 
accelerates the convergence of stochastic gradient methods by reducing the covariate 
shift. The idea has been extended further to various forms of normalization, such 
as layer norm [64], instance norm [65], group norm [66], etc. In addition to the 
original use of normalization for better convergence of stochastic gradients, adaptive 
instance normalization (AdaIN) [67] is another example where the normalization 
technique can be used as a simple but powerful tool for style transfer and generative 
models. 

On the other hand, attention has been drawn to computer vision applications 
based on intuition that we “attend to” a particular part when processing a large 
amount of information [68-72]. Attention has played the key role in the recent 
breakthroughs in natural language processing (NLP), such as Transformer [73], 
Google’s Bidirectional Encoder Representations from Transformers (BERT) [74], 
OpenAI’s Generative Pre-trained Transformer (GPT)-2 [75] and GPT-3 [76], etc. 

For beginners, the normalization and attention mechanisms look very heuristic 
without any clue for systematic understanding, which is even more confusing due 
to their similarities. In addition, understanding AdaIN, Transformer, BERT, and 
GPT is like reading recipes the researchers developed with their own secret sauces. 
However, an in-depth study reveals a very nice mathematical structure behind their 
intuition. 

In this chapter, we first review classical and current state-of-the art normalization 
and attention techniques, and then discuss their specific realization in various 
deep learning architectures, such as style transfer [77-83], multi-domain image 
transfer [84-87], generative adversarial network (GAN) [71, 88, 89], Transformer 
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[73], BERT [74], and GPT [75, 76]. Then, we conclude by providing a unified 
mathematical view to understand both normalization and attention. 


9.1.1 Notation 


In deep neural networks, a feature map is defined as a filter output at each layer. For 
example, feature maps from VGGNet are shown in Fig. 9.1, where the input image 
is a cat. Since there exist multiple channels at each layer, the feature map is indeed 
a 3D volume. Moreover, during the training, multiple 3D feature maps are obtained 
from a mini-batch. 


224x220 112x112 


Fig. 9.1 Examples of feature maps on one channel of each layer of VGGNet 
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To make the notation simple for mathematical analysis, in this chapter a feature 
map for each channel is vectorized. Moreover, we often ignore the layer-dependent 
indices in the features. Specifically, the feature map on a layer is represented by 


Xel|sy ose ler, (9.1) 


where x. € R#W*! +efers to the c-th column vector of X, which refers to the 
vectorized feature map of size of H x W at the c-th channel. We often use VN := HW 
to denote the number of pixels. Equation (9.1) is often represented with row vectors 
to explicitly show the row dependency: 


Xo) 2 | eR?rec. (9.2) 


yw 


where x! € R!* refers to the i-th row vector, representing the channel dimensional 
feature at the i-th pixel location. 


9.2. Normalization 


The basic idea of normalization is to normalize the input/feature layer by recentering 
and rescaling, although specific details differ depending on algorithms. Perhaps the 
most influential paper that has opened up the research field of normalization is on 
batch normalization [41], reflected in the total number of 25k citations as of Feb., 
2021. Thus, we first review the batch normalization techniques, and discuss how 
this evolves into different forms of the normalization techniques. 


9.2.1 Batch Normalization 


Batch normalization was originally proposed to reduce the internal covariate shift 
and improve the speed, performance, and stability of artificial neural networks. 
During the training phase of the networks, the distribution of the input on the 
current layer changes accordingly if the distribution of the feature on the previous 
layers changes, so that the current layer has to be constantly adapted to new 
distributions. This problem is particularly severe for deep networks because small 
changes in shallower hidden layers are amplified as they propagate through the 
network, causing a significant shift in deeper hidden layers. The method of batch 
normalization is therefore proposed to reduce these undesirable shifts by recentering 
and scaling. 
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Specifically, the batch normalization is carried out by the following transform: 


Y = 
Yeo= = (x¢ — fel) + Bel, (9.3) 
¢€ 


for allc = 1,--- ,C, where 1 © R#™ denotes the vector of ones, Yc and Be are 
trainable parameters for the c-th channel, and j1, and 6; are the mini-batch statistics 
defined by 


= = [ ] ( ) 
Uc = El Xe}, 9.4 
Oc = —e E[|lxe -— [lL 1||2 9.5 

c HW [Il c c I ], ( : ) 


where the expectation E[-] is taken over the mini-batch. In matrix form, (9.3) can be 
represented by 


Y=XT+B, (9.6) 
where 
Bo 
Po 2 |p eRe (9.7) 
0-8 
c B = 0 
B=[1---1] 
0 w+ B ie oa 


In addition to reducing the internal covariate shift, it is believed that batch 
normalization has many other advantages. With this additional operation, the 
network can use a higher learning rate without gradients vanishing or exploding. 
In addition, the batch normalization appears to have a regularization effect so that 
the network improves its generalization properties and therefore there is no need 
to use dropout to reduce overfitting. It has also been observed that with the batch 
normalization, the network becomes more robust towards different initialization 
schemes and learning rates. 

For example, Fig.9.2 shows that the batch norm (BN) layer is used within 
the structure of DenseNet [37] to improve the learning rate of the ImageNet 
classification task. Similarly, a powerful CNN image denoiser was proposed in [90] 
by just cascading BN layer, ReLU, and filter layers as shown in Fig. 9.3. 
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| : BN-ReLU-Conv | : Transition Layer 


Fig. 9.2. Batch norm layer in DenseNet 


4 


Noisy 


Residual 
Image Image 


Fig. 9.3. The use of batch norm in CNN denoiser 


9.2.2 Layer and Instance Normalization 


Batch normalization is a powerful tool, but not without its limitations. The main 
limitation of batch normalization is that it depends on the mini-batch when 
calculating (9.4) and (9.5). Then, how can we mitigate the problem of batch 
normalization? 

To understand this question, let us look into the volume of the feature maps that 
are stacked along the mini-batch in Fig. 9.4. The left column of Fig. 9.4 shows the 
normalization operation in batch norm, whereby the shadow area is used to calculate 
the mean and standard deviation for centering and rescaling. Here, B denotes the 
size of the mini-batch. 
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Batch Normalization Layer Normalization Instance Normalization 


HXW 
HXW 
HXW 


B B B 


Fig. 9.4 Various forms of feature normalization methods. B: batch size, C: number of channels, 
and H, W: height and width of the feature maps 


In fact, the picture of batch norm shows that there are several normalization 
options. For example, the layer normalization [64] computes the mean and standard 
deviation along the channel and image direction without considering the mini-batch. 
More specifically, we have 


\co= * (x- — wl) + AI, (9.8) 


for allc = 1,--- , C. Here, y and £ are channel-independent trainable parameters, 
while jz and o are computed by 


1 
a’ oe 9.9 
mn Awe 2 ei (9.9) 
1 Cc 
= | —— — pill’. wl 
0= | age DL lee — alll (9.10) 


In the layer normalization, each sample within the mini-batch has a different normal- 
ization operation, allowing arbitrary mini-batch sizes to be used. The experimental 
results show that layer normalization performs well for recurrent neural networks 
[64]. 

On the other hand, the instance normalization normalizes the feature data for each 
sample and channel as shown on the right-hand side of Fig. 9.4. More specifically, 
we have 


Ve= ne (X¢ — Mel) + Bel, (9.11) 


¢ 
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for allc = 1,--- , C, where 
: 1! (9.12) 
= —l1'x . 
ea 
= Vas | 1||2 (9.13) 
Oc = HW Xe — Mcl|l-, : 


whereas y, and , are trainable parameters for the channel c. In matrix form, (9.11) 
can be represented by 


Y=XT+B, (9.14) 


where T and B are similar to (9.7) but calculated for each sample. 


9.2.3 Adaptive Instance Normalization (AdaIN) 


With AdaIN [67], a new chapter of normalization method has opened, which 
goes beyond the classical normalization methods that were designed to improve 
the performance and reduce the dependency on learning rate. The most important 
finding of AdaIN is that the instance normalization transformation in (9.11) provides 
an important hint for the style transfer. 

Before we discuss the details of AdaIN, we first explain the concept of image 
style transfer. Figure 9.5 shows an example of image style transfer using AdaIN 
[67]. Here, the top row shows the content images associated with the content feature 
X = [x1,--- ,xXc], while the left-most column corresponds to style images that are 
associated with the style feature S = [s1,--- ,sc]. The aim of the image style 
transfer is then to convert the content images into a stylized image that is guided by 
a certain style image. How does AdaIN manage the style transfer in this context? 

The main idea is to use the instance normalization in (9.11), but instead of using 
Yc and 6, that are calculated by its own feature, these values are calculated as the 
standard deviation and the mean value of the style image, i.e. 


1 

pe = aa (9.15) 
S 1 2 2 

y= aw ise — Bell : (9.16) 


where s, is the c-th channel feature map from the style image. In matrix form, 
AdalIN can be represented by 


Y= XT,Ts + Bxs, (9.17) 
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Fig. 9.5 Examples of image style transfer using AdaIN [67] 


where 7, and T, are diagonal matrices computed from X and S, respectively: 


1 
aes 
T= |< 7 ero’ (9.18) 
1 
O--4 
ie 
Ty=| ist | eROxe, (9.19) 
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Fig. 9.6 The network architecture of AdaIN style transfer 


whereas B,.; is the bias term computed using both X and S: 


yy, 
_& [ai-Bus 0 
By =[1--- 1] ; rie : (9.20) 
0 pe — uc 


The generation of the style feature map can be done with the same encoder, as 
shown in Fig. 9.6, whereby both content and style images are given as inputs for the 
VGG encoder for feature vector extraction, from which the AdaIN layer changes 
the style using the AdaIN operation described above. 


9.2.4 Whitening and Coloring Transform (WCT) 


The whitening and coloring transform (WCT) is another powerful method of image 
style transfer [79], which is composed of a whitening transform followed by a 
coloring transform. Mathematically, this can be written by 


Y=XT,T;,+ By s, (9.21) 


where B,..s is the same as (9.20), and the whitening transform T, and the coloring 
transform T,. are computed by X and S, respectively: 


1 1 
T, =U,2,7U), T.,=Us%2U), (9.22) 


164 9 Normalization and Attention 


where U,, XZ, and U;, Xs are from the eigen-decomposition of the covariance 
matrices of X and S: 


X'X¥=U,2,U), S'S=U,z,U! (9.23) 


Therefore, we can easily see that AdaIN is a special case of WCT, when the 
covariance matrix is diagonal. 


9.3 Attention 


In cognitive neuroscience, attention is defined as the behavioral and cognitive 
process in which one selectively focuses on one aspect of information and ignores 
other perceptible information. In this section we describe a biological analogy of 
attention at the neuronal level and discuss its mathematical formulation. 


9.3.1 Metabotropic Receptors: Biological Analogy 


It is known that there are two types of neurotransmitter receptors: ionotropic and 
metabotropic receptors [91]. Ionotropic receptors are transmembrane molecules 
that can “open” or “close” a channel so that different types of ions can migrate 
in and out of the cell, as shown in Fig. 9.7a. On the other hand, the activation of 
the metabotropic receptors only indirectly influences the opening and closing of ion 
channels. In particular, a receptor activates the G-protein as soon as a ligand binds to 
the metabotropic receptor. Once activated, the G-protein itself goes on and activates 
another molecule called a “secondary messenger’. The secondary messenger moves 
until it binds to ion channels, located at different points on the membrane, and opens 
them (see Fig. 9.7b). It is important to remember that metabotropic receptors do not 
have ion channels and the binding of a ligand may or may not lead to the opening 
of ion channels at different locations on the membrane. 

Mathematically, this process can be modeled as follows. Let x, be the number 
of neurotransmitters that bind to the n-th synapse. G-proteins generated at the n-th 
synapse are proportional to the sensitivity of the metabotropic receptor, which is 
denoted by k,,. Then, the G-proteins generate the secondary messengers that bind to 
the ion channel at the m-th synapse with the sensitivity of gm. Since the secondary 
messengers are generated from metabotropic receptors at various synapses, the total 
amount of ion influx from the m-th synapse is determined by the sum given by 


N 
Yn = tikes m=1,---,N, (9.24) 
n=1 
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Fig. 9.7 Two different types of neurotransmitter receptors and their mechanisms 
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which can be represented by a vector form 
y=Tx, where T:= gk! . (9.25) 


Note that the matrix T in (9.25) is a transform matrix from x to y. Indeed, the 
transform matrix T is a rank-! matrix. Accordingly, the output y is constrained to 
live in the linear subspace of the column vector, i.e. R(q), where R(-) denotes the 
range space. This implies that the activation patterns in the neuron follow the ion 
channel sensitivity patterns, gq, while their magnitude is modulated by k. 

This could explain another role for the metabotropic receptors. In particular, 
metabotropic receptors act more for their prolonged activation than for a short- 
term activation as in the case of ionotropic receptors, since the activation pattern 
is determined by the ion channel distributions to which the secondary messengers 
bind rather than by the specific location at which the original neurotransmitter is 
released. Thus, the synergistic combination of the q and k determines the general 
behavior of neuronal activation. 


9.3.2 Mathematical Modeling of Spatial Attention 


In (9.25), the vectors qg and k are often referred to as query and key. It is 
remarkable that even with the same key k, a totally different activation pattern can 
be obtained by changing the query vector q. In fact, this is the core idea of the 
attention mechanism. By decoupling the query and key, we can dynamically adapt 
the neuronal activation patterns for our purpose. In the following, we review the 
general form of the attention developed based on this concept. 

In artificial neural networks, the model (9.24) is generalized for vector quantities. 
Specifically, the row vector output at the m-th pixel y’” € R© is determined by the 
vector version of query q” € R@, keys k” € R¢, and values x” € R©: 


N 
=) oases (9.26) 
n=1 
where m = 1,--- , N and 
exp (score(q”, k”) 
ann *>= ( (9.27) 


Yya1 exp (score(g”, k")) 


Here, score(-, -) determines the similarity between the two vectors. In matrix form, 
(9.26) can be represented by 


Y = AX, (9.28) 
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where 
Ps y! 
ASN el a Se Is (9.29) 
alt yN 
and 
ai, °°: Gin 
A= re . (9.30) 


an1°*:anNN 


Various forms of the score functions are used for attention: 


¢ Dot product: score(q™, k”) := (q™, k"). 
* Scaled dot product: score(q’”, k”) := (q"", k”)/VJd. 


* Cosine similarity: score(q”, kK”) := eT: 


For example, in dot production attention, the query and key vectors are usually 
generated using linear embeddings. More specifically, 


q’=x"Wo, kK” =x"Wr, n=1,-:-,N, (9.31) 


where Wo, Wk € RC*4 are shared across all indices. Matrix form representation 
of the query and key are then given by 


Q=XWo, K=XW«x, (9.32) 
where Q, K € R*@ are given by 
Cees Wh Rael, ely (9.33) 
N kN 


We are often interested in the embedding of x” to a smaller-dimensional vector 
v” € R%, which leads to the matrix representation of values: 


vi =x"Wy ER, (9.34) 


where Wy € R©** is the linear embedding matrix for the values. Then, attention 
is computed by 


N 
ee ere (9.35) 
n=1 
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where 


exp ((x” Wo, x"”Wx)) 


: (9.36) 
v1 exp ((x"Wo, x” Wx) 


Amn *= 


or in matrix form, we have 
Y=AXWy, (9.37) 


where X, Y and A are defined by (9.29) and (9.30), respectively. 


9.3.3 Channel Attention 


So far, we have discussed the mathematical formulation of spatial attention. One 
downside of spatial attention is that we need a matrix multiplication of N x N 
size of attention map A, which can be computationally intensive. To address the 
problem, channel attention techniques have been developed. One of the most well- 
known methods for channel attention is the so-called squeeze and excitation network 
(SENet), which won the 2017 ImageNet challenge [68]. 

The SENet is composed of two steps: the squeeze and the excitation (see 
Fig. 9.8). In the squeeze step, a | x C-dimensional vector z is generated by average 
pooling as follows: 


z=—1'X. (9.38) 


At the excitation step, a 1 x C weight vector w is generated from z using a neural 
network Fe which is parameterized by O: 


w = Fe(z). (9.39) 


Z Fo w 
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Fig. 9.8 Architecture of SENet 
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Then, the final attended map is given by 

Y=XW, where W := [diag(w)], (9.40) 
where diag(w) is a diagonal matrix whose diagonal component is obtained by the 
vector w. One can easily see that associated computational complexity is minimal. 


Still, the SENet provides efficient channel attention mechanism, which significantly 
improves the performance of the neural network [68]. 


9.4 Applications 


In this section, we provide a review of the exciting applications of normalization 
and attention in modern deep learning. 


9.4.1 StyleGAN 


One of the most exciting developments in CVPR 2019 was the introduction of a 
novel generative adversarial network (GAN) called StyleGAN from Nvidia [89]. As 
shown in Fig. 9.9, StyleGAN can generate high-resolution images that were realistic 
enough to shock the world. 

Although generative models, specifically GANs, will be discussed later in 
Chap. 13, we are introducing StyleGAN here, as the main breakthrough of style- 
GAN comes from AdaIN. The right-hand neural network of Fig. 9.10 generates the 


Fig. 9.9 Examples of fake faces generated by StyleGAN 
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Fig. 9.10 Architecture of StyleGAN 


latent codes used as the style image feature vector, while the left-hand network 
generates the content feature vectors from random noise. The AdaIN layer then 
combines the style features and the content features in order to generate more 
realistic features for each resolution. In fact, this architecture is fundamentally 
different from the standard GAN architecture that we will review later, with the 
fake image only being generated by a content generator (for example, the one on the 
left). Through the synergistic combination with another style generator, StyleGAN 
successfully produces very realistic images. 


9.4.2 Self-Attention GAN 


One important advantage of the attention mechanism is the separate control of 
query and key vectors. In the case of self-attention, both the query and the key are 
obtained from the same data set. In this case, the attention tries to extract the global 
information from the same input signal in order to find out which part of the signal 
needs to be focused. 
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Fig. 9.11 Architecture of self-attention GAN. Both key and query are generated by the input 
features 


In a self-attention GAN (SAGAN) [71], self-attention layers are added into the 
GAN so that both the generator and the discriminator can better capture model 
relationships between spatial regions (see Fig. 9.11). It should be remembered that 
in convolutional neural networks, the size of the receiving field is limited by the size 
of the filter. With this in mind, self-attention is a great way to learn the relationship 
between a pixel and all other positions, even regions that are far apart so that global 
dependencies can be easily grasped. Hence, a GAN endowed with self-attention is 
expected to handle details better. 

More specifically, let X € R'*© be the feature map with N pixels and C 
channels, and x” € R© denote the m-th row vector of X , which represents the 
feature vector at the m-th pixel location. The query, key, and the value images are 
then generated as follows: 


qu =x"Wo, k™=x"Wr, vw" =x"Wy (9.41) 
for all pixel indices m = 1,--- , N. Note thatWo,Wx, Wy € RO*© matrices can 


be implemented using | x 1 convolution (see Fig. 9.11). Then, similar to (9.37), the 
attended image is represented by 


Y=AV=AXWy, (9.42) 
where 
ral 
VSS Is (9.43) 
ah 


and the (m, n)-th element of A matrix is given by 


Amn = exp ({a", #) (9.44) 


Dias exp ((g”, k")) 
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Then, the final self attended feature map is calculated by 
O=YWo, (9.45) 


which can also be implemented using 1 x | convolution. 

As shown in (9.42) and (9.45), the new feature vector 0” is generated at the 
m-th pixel location by the linear combination of the value vectors {v7}, across 
the whole image by weighting the elements of the attention map A. Therefore, 
the receptive field of the self-attention map is an overall image, which makes the 
image generation more effective. A disadvantage, however, is that we need a matrix 
multiplication of N x WN size of attention map A, which can be computationally 
expensive. 


9.4.3 Attentional GAN: Text to Image Generation 


In Attentional GAN (AttnGAN) [72], the authors proposed an attention-driven 
architecture for text-to-image generation (see Fig. 9.12). In addition to the detailed 
structure for a fine-grained translation, the key idea of AttnGAN is to use the cross- 
domain attention. In particular, the query vector is generated from image areas, 
while the key vector is generated from word features. By combining the query and 
key, AttnGAN can automatically select the word level condition to generate different 
parts of the image [72]. 


9.4.4 Graph Attention Network 


In the graph attention network (GAT) [69], the main focus is on a node which a 
neural network should visit more in order to achieve better embedding in the middle 
node (Fig.9.13). To incorporate the graph connectivity, the authors suggested 
specific constraints on the query, key, and value vectors as follows: 


qea=x’W, kX =v =x"W, we Nv). (9.46) 
From this, the attentional coefficients between the nodes are calculated by 
€vy = score(q”, k"), 


where score(-) denotes the specific attention mechanism. To make the coefficient 
easily accessible across different nodes, the coefficients are normalized by 


exp(vu) 
ene EXP(Cyy’) 


yy = 


(9.47) 
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Fig. 9.13 Graph attention network 


Then, the graph neural network is represented by the normalized connective 
coefficient: 


x’=0 > Oy x W |. (9.48) 
uEeN(v) 


9.4.5 Transformer 


Transformer is a deep machine learning model that was introduced in 2017 and 
was originally used for natural language processing (NLP) [73]. In NLP, the 
recurrent neural networks (RNN) such as Long Short-Term Memory (LSTM) [92] 
had traditionally been used. In RNN, the data is processed in a sequential order 
using the memory unit inside. Although Transformers are designed to process 
ordered data sequences such as speech, unlike the RNN, Transformer processes the 
entire sequence in parallel to reduce path lengths, making it easier to learn long- 
distance dependencies in sequences. Since its inception, Transformer has become 
the building block of most state-of-the-art architectures in NLP, resulting in the 
development of famous state-of-the-art Bidirectional Encoder Representations from 
Transformers (BERT) [74], Generative Pre-trained Transformer 3 (GPT-3) [76], etc. 

As shown in Fig.9.14, Transformer-based language translation consists of an 
encoder and decoder architecture. The main idea of Transformer is the attention 
mechanism discussed earlier. In particular, the essence of the query, key, and value 
vectors in the attention mechanism is fully utilized so that the encoder can learn the 
language embedding and the decoder performs the language translation. 
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Fig. 9.15 Network architecture of encoder of Transformer 


In particular, sentences from, for example, English are used on the encoder to 
learn how to embed each word in a sentence. In order to learn the long-range 
dependency between the words within the sentence, a self-attention mechanism 
is used on the encoder. Of course, self-attention is not enough to perform a 
complicated speech embedding task. Therefore, there are an additional residual 
connection, a layer normalization, and a neural feedforward network, followed 
by additional units of encoder blocks (see Fig. 9.15). Once trained, Transformer’s 
encoder generates the word embedding, which contains the structural role of each 
word within the sentence. 

In the decoder, these embedding vectors from the encoder are now used to 
generate the key vectors, as shown in Figs.9.14 and 9.16. This is combined with 
the query vector that is generated from the target language, like French. This hybrid 
combination then creates the attention map, which serves as the transformation 
matrix of the words between the two languages by taking into account their 
structural roles. 
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Fig. 9.16 Generation of key vectors for each decoder layer of Transformer 


Another important component of Transformer is the positional encoding (see 
the positional encoding blocks in Figs.9.14 and 9.15). In contrast to RNN and 
LSTM, each word in a sentence is processed simultaneously by Transformer in 
order to capture longer dependencies in a sentence, so the model itself does not 
have any notion of position for each word. However, the position of a word within 
a sentence is important as it determines the grammar and the semantics of the 
sentence. Therefore, there is a need to consider the order of the words, and the 
positional encoding is used for this. To be a valid positional encoding, a method 
should output a unique encoding of the position of each word in a sentence and 
easily generalize to longer sentences. 

Among the various possible approaches, the original authors of Transformer used 
the sine and cosine functions of different frequencies [73]. More specifically, let n 
be the desired position in an input sentence and p,, € R¢@ be its corresponding 
encoding, where d is the encoding dimension for which an even number is chosen. 
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Then, the position encoding vector is given by 


sin(@,n) 
cos(@ Nn) 
sin(@2n) 


1 
Pr = eee) E R?, where Ok = 700002k/4° (9.49) 


sin(@an) 
2 


cos(wan) 
2 


This position encoding vector is then added to the word embedding vector x, € R4 
to obtain a position encoded word embedding vector: 


Xn <—Xn+ Py, (9.50) 


which is then fed into the self-attention module in Transform. 

Readers may wonder why the positional encoding vector is summed with a 
word embedding instead of concatenation. Although this was used empirically 
in the original paper [73], recent theoretical analysis showed that Transformer 
architecture with additive positional encodings is Turing complete [93] and can be 
reparametrized to express any convolutional layer [94]. 

Transformer is an ingenious combination of the full mathematical principle of 
attention, which uses separate query and key vectors for the specific purpose of 
language translation. Because of this, Transformer has become the main workhorse 
for modern NLP. 


9.4.6 BERT 


One of the latest milestones in NLP is the release of BERT (Bidirectional Encoder 
Representations from Transformers) [74]. This release of BERT can even be seen as 
the beginning of a new era in NLP. One of the unique features of BERT is that the 
resulting structure is as regular as FPGA (Field Programmable Gate Array) chips, so 
the BERT unit can be used for different purposes and languages by simply changing 
the training scheme. 

The main architecture of BERT is the cascaded connection of bidirectional 
transformer encoder units, as shown in Fig. 9.17. Due to the use of the encoder-part 
of the Transformer architecture, the number of input and output features remains 
the same, while each feature vector dimension may be different. For example, 
the input feature can be a one-hot coded word, the feature dimension of which 
is determined by the size of the corpus vocabulary. The output may be the low 
dimensional embedding that sums up the role of the word in context. The reason for 
using the bidirectional Transformer encoder is based on the observation that people 
can understand the sentence even if the order of the words within the sentence is 
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Fig. 9.17 BERT architecture 
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Fig. 9.18 Pre-training and fine-tuning scheme for BERT training 


reversed. By considering the reverse order, the role of each word in context is better 
summarized as an attention map, resulting in more efficient embedding of words. 

Yet another beauty of BERT lies in the training. More specifically, as shown 
in Fig. 9.18, BERT training consists of two steps: pre-training and fine-tuning. In 
the pre-training step, the goal of the task is to guess the masked word within an 
input sentence. Figure 9.19 shows a more detailed explanation of this masked word 
estimation. Approximately 15% of the words in the input sentence from Wikipedia 
are masked with a specific token (in this case, [MASK]), and the goal of the training 
is to estimate the masked word from the embedded output in the same place. Since 
the BERT output is just an embedded feature, we need an additional fully connected 
neural network (FFNN) and softmax layer to estimate the specific word. With this 
additional network we can correctly pre-train the BERT unit. 

Once BERT pre-training is finished, the BERT unit is fine-tuned using supervised 
learning tasks. For example, Fig. 9.20 shows a supervised learning task. Here, the 
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input for BERT consists of two sentences, separately with another token [SEP]. The 
goal of supervised learning is then to assess whether the second sentence is a correct 
continuation of the first sentence. The output of this is now embedded in the BERT 
Output 1, which is then used as an input of fully connected neural network, followed 
by a softmax layer to estimate whether the second sentence is next. Since the same 
number is entered and output in BERT, the first word of the input record should be 
a token that indicates the vacant word [CLS]. 

Another example of a supervised fine-tuning is the classification of whether the 
sentence is spam or not, as shown in Fig. 9.21. In this case, only a single sentence is 
used as the BERT input and Output | of BERT is used to classify whether the input 
sentence is spam or not. 

In fact, there are multiple ways of utilizing the BERT unit for supervised fine 
tuning, which is another important advantage of BERT [74]. 


9.4.7 Generative Pre-trained Transformer (GPT) 


Generative pre-trained transformers (GPTs) are language models developed by 
OpenAI that produce human-like text. In particular, the third-generation model, 
GPT-3, is arguably the most powerful and controversial artificial intelligence model 
for NLP due to its incredible ability to produce text that is indistinguishable from 
what written by humans [76]. 

Recall that BERT requires pre-training for a large corpus of text, followed by 
fine-tuning a specific task. However, the requirement of a task-specific, finely tuned 
training data set consisting of thousands or tens of thousands of examples is often 
quite demanding. This is very different from humans, who are usually able to 
complete a new language task using a few examples. 

GPT-2 [75] and GPT-3 [76] were developed based on the observation that 
scaling the language model greatly improves task-agnostic, few-shot performance, 
and sometimes even competes with prior art fine-tuning approaches. The goal of 
GPT training is similar to BERT pre-training, where the next word in a sentence 
is estimated based on the previous words in a sentence. For this reason, GPT 
stands for generative pre-trained Transformer. For example, the GPT is trained to 
generate the word “awesome” by using the preceding words “The latest language 
model GPT-3 is” as input. While this pure pre-training scheme doesn’t improve 
BERT’s performance, one of the main reasons for the success of GPT-2, and GPT- 
3 in particular, is its massive architecture that makes generative pre-training even 
more powerful than fine-tuning. Compared to the largest BERT architecture with 
around 340 million parameters, GPT-3 is extremely massive with around 175 billion 
parameters. 

Recall that the generative estimation of the following word can be done by the 
Transformer decoder in the language translation. Accordingly, GPT-3 consists of 
a stack of 96 Transformer decoder layers, which differs from the encoder-only 
architecture in BERT (see Fig. 9.22). Each decoder layer is composed of multiple 
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GPT -2 BERT 


Fig. 9.22 Differences in BERT and GPT architecture 
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Fig. 9.23 Architecture of GPT decoder block 


Self-Attention Masked Self-Attention 


Fig. 9.24 Difference between the self-attention in BERT and masked self-attention in GPT-3 


decoder blocks, which consist of masked self-attention blocks with a width of 2048 
tokens and a feedforward neural network (see Fig. 9.23). As shown in Fig. 9.24, the 
masked self-attention calculates the attention matrix using the preceding words in a 
sentence that can be used to estimate the next word. 

To train the 175 billion weights, GPT-3 is trained with 499 billion tokens or 
words. Sixty percent of the training data set comes from a filtered version of 
Common Crawl consisting of 410 billion tokens. Other sources are 19 billion tokens 
from WebText2, 12 billion tokens from Books1, 55 billion tokens from Books2, and 
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3 billion tokens from Wikipedia [76]. Nonetheless, the performance of GPT-3 can 
be affected by the quality of the training data. For example, it was reported that 
GPT-3 generates sexist, racist and other biased and negative language when it was 
asked to discuss Jews, women, black people, and the Holocaust [95]. 


9.4.8 Vision Transformer 


Inspired by the fact that Transformer architecture has become state of the art 
for NLP, researchers have explored its applications for computer vision. As 
mentioned earlier, in computer vision, attention is usually applied in connection 
with convolutional networks, so that certain components of convolutional networks 
are replaced with attention while maintaining their overall structure. In [96], the 
authors have shown that this dependence on CNNs is not necessary and a pure 
transformer applied directly to sequences of image patches can work very well in 
image classification tasks. 

Their model, called Vision Transformer (ViT), is depicted in Fig. 9.25. To handle 
2D images, the input image x is reshaped into a sequence of flattened 2D patches, 
after each patch is embedded into a D-dimensional vector using a trainable linear 
projection. Transformer then uses a constant latent vector size D through all of its 
layers. Position embeddings are added to the patch embeddings to retain positional 
information. The resulting sequence of embedding vectors serves as input to the 
encoder. With regard to the [Class] token on the front, a learnable embedding 
in the sequence of embedded patches at the output of the Transformer encoder 
serves as the entire image representation. A classification head is attached during 
both pre-training and fine-tuning to train the network to have the embedded image 
representation for the best classification results. 

The Transformer encoder in ViT consists of alternating layers of multi-headed 
self-attention and MLP blocks. Layer norm and residual connections are applied 
before and after every block, respectively. The MLP contains two layers with a 
GELU non-linearity. Typically, ViT is trained on large data sets, and fine-tuned to 
(smaller) downstream tasks. For this, we remove the pre-trained prediction head 
and attach a zero-initialized D x K feedforward layer, where K is the number of 
downstream classes. 


9.5 Mathematical Analysis of Normalization and Attention 


So far we have discussed normalization and attention. Normalization was originally 
developed for accelerating stochastic gradient methods, and has been extended 
to style transfer, image generation, etc. On the other hand, due to its ability 
to learn long-range relationships and its flexibility from manipulating query and 
key, attention has been successfully extended to various applications, leading to 
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breakthroughs in natural language processing approaches such as BERT, GPT-3, 
etc. 

As you may have noticed while reading, normalization and attention may have 
a very similar mathematical formulation. For example, for a given feature map 
X € R#WXC’ the instance normalization, AdaIN, and WCT can be represented 
as follows: 


Y=XT+B, (9.51) 


where the channel-directional transform T and the bias B are learned from the 
statistics of the feature maps. The only differences between instance normalization, 
AdaIN, and WCT are their specific ways of estimating T and B. For example, 
all elements of T are estimated from the input features in the case of instance 
normalization, while they are estimated from the statistics on content and style 
images in the case of AdaIN and WCT. The main difference between WCT, instance 
normalization and AdaIN is that T is a densely populated matrix for the case of 
WCT, while instance norm and AdalIN use a diagonal matrix. 
On the other hand, the spatial attention can be represented by 


Y=AX, (9.52) 


where A is calculated from its own feature for the case of self-attention, or with the 
help of other domain features for the case of cross-domain attention. Similarly, the 
channel attention such as SENet can be computed as 


Y = XT, (9.53) 


where the diagonal matrix T is again calculated from X. 

This implies that normalization and attention, with the exception of the specific 
differences in the generation of A, T, W, and B, can be viewed as a special case of 
the following transformation: 


Y=AXT+B. (9.54) 


Mathematically, A modifies the column space of X, whereas T control the row 
space of X. Therefore, the attention map A differs from T and controls different 
factors and the variations in the feature X. 

Based on this observation, Kwon et al. [97] proposed the so-called Diagonal 
GAN. This is based on the following intuition: although A was a dense matrix 
obtained from X in the original self-attention, the insight from AdaIN can be used to 
obtain an efficient diagonal attention map A from a novel attention code generator 
for content control. Specifically, they introduced a novel diagonal attention (DAT) 
module to manipulate the content feature maps as shown in Fig.9.26b. One of 
the important advantages of the method is that thanks to the symmetry in (9.54), 
both AdaIN and DAT can be applied to each layer, so that the image content and 
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style can be modulated independently. This leads to an effective disentanglement of 
the content and style components in generated images. Furthermore, the proposed 
method has flexibility by selectively controlling the spatial attribute of generated 
images at arbitrary resolution by changing the hierarchical attention maps. 

As shown in Fig. 9.27, the combination of AdaIN and DAT is quite impressive. 
For given source images in Fig. 9.27a, which are generated from arbitrary style and 
content code, (b) shows the samples with varying style codes and fixed content code. 
Note that the hairstyles and identities vary while the face directions and expressions 
are similar. On the other hand, if we generate samples with varying content codes 
and fixed style, the face direction and expression for the same person or animal 
changes. Finally, if the content and style codes are both varied as shown in (c), 
the face direction, expression, hair styles, and person’s identity change accordingly. 
This clearly shows the disentanglement between style and content. 

One may wonder whether additive noise at each layer of styleGAN in Fig. 9.26b 
may serve a similar role in the content variation. In fact, the addition of the noise 
for the original styleGAN is from a similar motivation, as indicated by the authors’ 
claim that the right-hand network generates the content feature vectors from random 
noise. That said, it should be remembered that the additive noise terms are basically 
additions to the bias term in (9.54), which is fundamentally different from A that 
modulates the column space of X. In fact, the additional bias terms both affect the 
row and column spaces of X, resulting in the entangled modulation between the 
style and content. 


9.6 Exercises 


1. Find the conditions when the WCT transform in (9.22) is reduced to AdaIN. 
2. Let the feature map with the number of pixels H x W = 4 and the channels 
C = 3 be given by 


L233 
—-1-3 0 

X= : : 
Sg (9.55) 
0 0 —-5 


a. Perform the layer normalization of X. 
b. Perform the instance normalization of X. 
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Fig. 9.27 (Top) 1024 1024 images generated by our method, trained using CelebA-HQ data set. 
(Bottom) 512x512 images generated by our method, trained using AFHQ data set. (a) A source 
image generated from arbitrary style and content code. (b) Samples with varying style codes and 
fixed content code. (c) Samples generated with varying content codes and fixed style. (d) Samples 
generated with both varying content and style codes 
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3. Additionally, suppose that the feature map for the style image is given by 


0 11 
-1-11 
S= 1 ool: (9.56) 


-1 11 


a. For the given feature map in (9.55), perform the adaptive instance normaliza- 
tion from X to the style of S. 

b. For the given feature map in (9.55), perform the WCT style transfer from X 
to the style of S. 


4. Using the feature map in (9.55), we are interested in computing the self-attention 
map. Let Wg and W, be the embedding matrices for the query and key, 
respectively: 


1 z 0 
Wo=|05], We=] 1-1}. (9.57) 
00 10 5 


a. Using the dot product score function, compute the attention matrix A. 

b. What is the attended feature map, i.e. Y = AX? 

c. For the case of masked self-attention in GPT-3, compute the attention mask A 
and attended feature map Y = AX. 


5. For a given positional encoding in (9.49) for the Transformer with encoding 
dimension d = 10, compute the positional encoding vector p, for n = 
1,--- , 10. 

6. Explain the following sentence in detail: “BERT has encoder only structure, 
while GPT-3 has decoder only architecture.” 

7. Fora given feature map X € R*C, show that the feature map of styleGAN after 
the application of AdaIN and noise is represented by 


Y=XT+B. (9.58) 
Specify the structure of the matrices T and B. 
8. For a given feature map X € R“*C, show that the feature map of the Diagonal 
GAN after the application of AdaIN, DAT, and noise is represented by 
Y=AXT+B. (9.59) 


Specify the structure of the matrices A, T and B, and their mathematical roles. 


Part III 
Advanced Topics in Deep Learning 


“Tam really confused. I keep changing my opinion on a daily basis, and I cannot 
seem to settle on one solid view of this puzzle. No, I am not talking about world 
politics or the current U.S. president, but rather something far more critical to 
humankind, and more specifically to our existence and work as engineers and 


researchers. I am talking about ... deep learning.” 


— Michael Elad 


Chapter 10 M®) 
Geometry of Deep Neural Networks cen 


10.1 Introduction 


In this chapter, which is mathematically intensive, we will try to answer perhaps the 
most important questions of machine learning: what does the deep neural network 
learn? How does a deep neural network, especially a CNN, accomplish these goals? 
The full answer to these basic questions is still a long way off. Here are some of 
the insights we’ve obtained while traveling towards that destination. In particular, 
we explain why the classic approaches to machine learning such as single-layer 
perceptron or kernel machines are not enough to achieve the goal and why a modern 
CNN turns out to be a promising tool. 

Recall that at the early phase of the deep learning revolution, most of the 
CNN architectures such as AlexNet, VGGNet, ResNet, etc., were mainly developed 
for the classification tasks such as ImageNet challenges. Then, CNNs started to 
be widely used for low-level computer vision problems such as image denoising 
[90, 98], super-resolution [99, 100], segmentation [38], etc., which are considered as 
regression tasks. In fact, classification and regression are the two most fundamental 
tasks in machine learning, which can be unified under the umbrella of function 
approximation. Recall that the representer theorem [15] says that a classifier design 
or regression problem for a given test data set {(x;, y;)}/_, can be addressed by 
solving the following optimization problem: 


i Li FI COLE f (xi)) (10.1) 
min =|]. Yi, J\Xi)), . 
feH 2 H ‘1 
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where Hf, denotes the reproducing kernel Hilbert space (RKHS) with the kernel 
k(x, x’), || - |lg¢ is the Hilbert space norm, and £(., -) is the loss function. One of the 
most important results of the representer theorem is that the minimizer f has the 
following closed-form representation: 


f(x) = Yo ark(xi, x), (10.2) 


i=l 


where {q;}_, are learned parameters from the training data set. For example, if a 
hinge function is used as a loss, the solution becomes a kernel SVM, whereas if an 
ly function is used as a loss, it becomes a kernel regression. 

In general, the solution f(x) in (10.2) is a nonlinear function of the input x based 
on the kernel k(x ;, -), which is nonlinearly dependent upon x. This nonlinearity of 
the kernel makes the expression in (10.2) more expressive, thereby generating a 
wide variation of functions within the RKHS H. 

That said, the expression in (10.2) still has fundamental limitations. First, the 
RKHS Hi is specified by choosing the kernel in a top-down manner, and to the best 
of our knowledge, there is no way to automatically learn from the data. Second, 
once the kernel machine is trained, the parameters {a;}?_, are fixed, and it is not 
possible to adjust them at the test phase. These drawbacks lead to the fundamental 
limitations of the expressivity of neural networks, which means the capability of 
approximating any function. Of course, one could increase the expressivity by 
increasing complexity of the learning machines, for example, by combining multiple 
kernel machines. However, our goal is to achieve better expressivity for a given 
complexity constraint, and in this sense the kernel machine has problems. 


10.1.1 Desiderata of Machine Learning 


Given the limitations of the kernel machine, we can state the following desiderata— 
the desired things that an ultimate learning machine should satisfy: 


¢ Data-driven model: The function space that a learning machine can represent 
should be learned from the data, rather than specified by a top-down mathemati- 
cal model. 

e Adaptive model: Even after the machine has learned, the learned model should 
adapt to the given input data at the test phase. 

e Expressive model: The expressivity of the model should increase more than the 
model complexity increases. 

¢ Inductive model: The learned information from the training data should be used 
at the test phase. 


In the following, we review two classical approaches—single layer perceptron and 
frame representation—and explain why these classical models failed to meet the 
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desiderata. Later we will show how the modern deep learning approaches have been 
developed by overcoming the drawbacks of these classical approaches by exploiting 
their inherent strengths. 


10.2 Case Studies 


10.2.1 Single—Layer Perceptron 


The single-layer perceptron is a special case of the multilayer perceptron (MLP), 
which consists of fully connected neurons at the single hidden layer. Specifically, 
let gy : Rt R be a nonconstant, bounded, and continuous activation function. Let 
X Cc R” denote the input space. Then, a single-layer perceptron fg : X + Rcan 
be represented by 


d 
fo(x) = Py ViP (w; x + bi) , xX EX, (10.3) 


i=1 


where w; € R” is a weight vector, v;,b; € R are real constants, and O = 

{(w;, Uj, bi) }4_ , Tepresents the neural network parameters. Then, the parameters 

are estimated by solving the following optimization problem using the training data 
. y,\yVV.- 

{(xi, YO }iet: 


a (yi, fo(xi)) +AR(O), (10.4) 


where A is a regularization parameter and R(@) is a regularization function with 
respect to the parameter set O. 

One of the classical results for the representation power of single-layer percep- 
trons dates from 1989 [48]. It states that a feed-forward network with a single hidden 
layer containing a finite number of neurons can approximate continuous functions 
on compact subsets under mild assumptions on the activation function. 


Theorem 10.1 (Universal Approximation Theorem[48]) Let the space of real- 
valued continuous functions on a compact set X be denoted by C(X). Then, given 
any € > O and any function g € C(X), there exist an integer d such that the single 
layer perceptron in (10.3) is an approximate realization of the function f ; that is, 


| fe(x) — g(x)| <eé 


forallx € X. 
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The theorem thus states that simple neural networks can represent a wide variety 
of interesting functions when given appropriate parameters. In fact, the universal 
approximation theorem was a blessing for classic machine learning; it promoted the 
research interests of the neural network as a powerful functional approximation, but 
also turned out to be a curse for the development of machine learning by preventing 
understanding of the role of deep neural networks. 

More specifically, the theorem only guarantees the existence of d, the number 
of neurons, but it does not specify how many neurons are required for a given 
approximation error. Only recently have people realized that the depth matters, i.e. 
there exists a function that a deep neural network can approximate but a shallow 
neural network with the same number of parameters cannot [101-105]. In fact, these 
modern theoretical studies have provided a theoretical foundation for the revival of 
modern deep learning research. 

When compared with the kernel machine (10.2), the pros and cons of the single- 
layer perception in (10.3) can be easily understood. Specifically, @ (wx + bi) in 
(10.3) works similarly as a kernel function k(x;, x), and v; in (10.3) is similar to the 
weight parameter a; in (10.2). However, the nonlinear mapping in the perceptron, 
ie. @ (w, x + bi), does not necessarily satisfy the positive semidefiniteness of the 
kernel, thereby increasing the approximable functions beyond the RKHS to a larger 
function class in Hilbert space. Therefore, there exists potential for improving the 
expressivity. On the other hand, the weighting parameters v; are still fixed once the 
neural network is trained, which leads to limitations similar to those of the kernel 
machines. 


10.2.2. Frame Representation 


Now, we review another class of function representation called a frame [1]. To 
understand the mathematical concept of a frame, we start with its simplified form— 
the basis. 

In mathematics, a set B = {b;}'_, of elements (vectors) in a vector space 
V is called a basis, if every element of V may be written in a unique way as a 
linear combination of elements of B, that is, for every f € V, there exists unique 
coefficient {a;} such that 


f= S > aid. (10.5) 


i=l 


Unlike the basis, which leads to the unique expansion, the frame is composed of 
redundant basis vectors, which allows multiple representation. Frames can also be 
extended to deal with function spaces, in which case the number of frame elements 
is infinite. Formally, a set of functions 


® = [Px ]ecr =[--- be_-1 Ox --'] 
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in a Hilbert space H is called a frame if it satisfies the following inequality [1]: 


all fl? < OF. OP < BIFIP. VF eH, (10.6) 


kel 


where a, 8 > 0 are called the frame bounds. If a = £, then the frame is said to be 
tight. In fact, a basis is a special case of tight frames. 

By writing cx, := (f, 6;) as the expansion coefficient with respect to the k-th 
frame vector @; and defining the frame coefficient vector 


c=[ckher = 'f, 
(10.6) can be equivalently represented by 
all fil’ <llel’ <BIF IP, VF € #. (10.7) 


This implies that the energy of the expansion coefficients should be bounded by the 
original signal energy, and for the case of the tight frame, the expansion coefficient 
energy is the same as the original signal energy up to the scaling factor. 

When the frame lower bound a is nonzero, then the recovery of the original 
signal can be done from the frame coefficient vector c = ®! f using the dual frame 
operator ® given by 


om eee’ ee ae (10.8) 


which satisfies the so-called frame condition: 


éo' =], (10.9) 
because we have 

f:= c= 00' f= f, 
or equivalently, 


f= Do cebe = OF. On) be- (10.10) 


keD kel 


Note that (10.10) is a linear signal expansion, so it is not useful for machine 
learning tasks. However, something more interesting occurs when it is combined 
with a nonlinear regularization. For example, consider a regression problem to 
estimate a noiseless signal from the noisy measurement y: 


y=ftu, (10.11) 
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where w is the additive noise and f is the unknown signal to estimate. If we 
formulate a loss function as follows: 


1 
min sly fl + Al®" fil. (10.12) 
where || - ||; is the 7; norm, then the solution satisfies the following [106]: 
f= Y— pr ((y, bx) dx. (10.13) 
keD 


where (-) is a nonlinear thresholding function that depends on the regularization 
parameter A. This implies that the signal representation changes depending on the 
input y, since only a small set of coefficients (y, @;) will be nonzero after processing 
with the nonlinear thresholding, and the signal is represented by only a small set of 
dual bases ¢; corresponding to the locations of the nonzero expansion coefficients. 

For the last few decades, one of the most widely used frame representations in 
signal processing is the wavelet frame, or framelet [106], where its basis function 
captures the multi-resolution scale and shift dependent features. For example, 
Fig. 10.1 illustrates the Haar wavelet basis across different scale parameters j. 
As the scale increases, the support of the basis 6; becomes narrow so that it can 
capture more localized behavior of the signal after applying the inner product. More 
specifically, Fig. 10.2 shows the noiseless original signal f and its noisy version y, 
and their wavelet expansion coefficients. Here, d;(n) denotes the s-scale wavelet 
expansion coefficients. As shown in Fig. 10.2, for the smooth noiseless signal, most 
of the wavelet expansion coefficients are zero except a few expansion coefficients 
at lower scales. On the other hand, for the noisy signal, the small magnitude 
nonzero wavelet expansion coefficients are found across all scales. Therefore, 
the main idea of the wavelet shrinkage for signal denoising [107] is zeroing out 
the small-magnitude wavelet coefficients using a thresholding operation ,(-) and 
retaining large wavelet coefficients beyond the threshold values that have important 
signal characteristics. Accordingly, reconstruction using (10.13) can recover the 
underlying noiseless signals. 

Extending this idea beyond the signal denoising, other successful tools in the 
signal processing theory are the compressed sensing or sparse recovery techniques 
[46]. In particular, compressed sensing theory is based on the observation that when 
images are represented via bases of frames, in many cases they can be represented 
as a sparse combination of bases or frames, as shown in Fig. 10.3. Thanks to 
the sparse representation, even when the measurements are very few below the 
classical limits such as Nyquist limit, one could obtain a stable solution of the 
inverse problem by searching for the sparse representation that generates an output 
consistent with the measured data, as shown in Fig. 10.3. As a result, the goal of 
the image reconstruction problem is to find an optimal set of sparse basis functions 
suitable for the given measurement data. This is why the classical method is often 
called the basis pursuit [46]. 
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Fig. 10.1 Haar wavelet basis across scales 


In contrast to the kernel machine in (10.2), the basis pursuit using the frame 
representation has several unique advantages. First, the function space that the basis 
pursuit can generate is often larger than the RKHS from (10.2). In fact, this space is 
often called the union of subspaces [108], which is a large subset of a Hilbert space. 
Second, among the given frames, the choice of active dual frame basis bk is totally 
data-dependent. Therefore, the basis pursuit representation is an adaptive model. 
Moreover, the expansion coefficients p) (( y,@ x)) of the basis pursuit are also totally 
dependent on the input y, thereby generating more diverse representation than the 
kernel machine with fixed expansion coefficients. 

Having said this, one of the most fundamental limitations of the basis pursuit 
approach in (10.13) is that it is transductive, which does not allow inductive learning 
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Fig. 10.2 Wavelet coefficients for two signals across scales 
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Fig. 10.3 Reconstruction principle of compressed sensing 
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from the training data. In general, the basis pursuit regression in (10.12) should be 
solved for each data set, since the nonlinear thresholding function should be found 
by an optimization method for each data set. Therefore, it is difficult to transfer the 
learning from one data set to another. 


10.3. Convolution Framelets 


Before we dive into the convolutional neural network, here we briefly review the 
theory of deep convolutional framelets [42], which is a linear frame expansion but 
turns out to be an important stepping stone to understand the geometry of CNN. For 
simplicity, we consider the 1-D version of the theory. 


10.3.1 Convolution and Hankel Matrix 
Let an n-dimensional signal x € R” be represented by 


x =[x[0]---xIn—]' eR". 


Then, the following results are standard in signal processing: 


¢ Given two vectors x, h € R”, the circular convolution is defined by 


(x @ h)[i] 2. [i —k]h (10.14) 


where appropriate periodic boundary conditions are imposed on x. 
¢ For any v € R” and w € R™ with), 2 <n, define the convolution in R” as 


ve@w=v' @vw’, 


where 


0 r 0 r 
v =|e7 0; | » Ww =[uT 0; | 


e For any v € R”! with n, <n, define the flip of v as v[n] = v 
use the periodic boundary condition. 


—n], where we 
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Using these notations, a single-input single-output (SISO) circular convolution 
of the input f and the filter yw € R” with r <n can be represented by: 


n—1 
ylil = (x @ WL = > xfi — ed [-k. (10.15) 
k=0 


By defining a Hankel matrix Hi!’ (x) € R”’*" as 


x[O] x[1]---x[r—- 1] 
‘ x{1] x[2]--- xf] 
Hy (x) = ae (10.16) 
x[n — 1] x[n] --- x[r — 2] 
the convolution in (10.15) can be compactly represented by 


y=x@p=H)y. (10.17) 


Then, we can obtain the following key equality [109], whose proof is repeated here 
for educational purposes: 


Lemma 10.1 Fora given f € R", let H(f) € R"*" denote the associated Hankel 
matrix. Then, for any vectors u € R” and v € R" with r < n and Hankel matrix 
F := H'(f), we have 

u'Fu=u' (f@vn=f' ue®v)=(f,u®d), (10.18) 


where v[n] := v[—n] denotes the flipped version of the vector v. 


Proof We only need to show the second equality. This can be shown as 


sf wer) =f" (ur) 


II 
| 
_ 
Ss 
= 
a 
= 
wl 
_ 
= 
= 
> 
Ral 
e 
oO 
= 
| 
> 
nan 
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n—-1 


= k\(f ® d)[K] 


=u! (f ®D). 


This concludes the proof. oO 


10.3.2 Convolution Framelet Expansion 


Lemma 10.1 provides an important clue for the convolution framelet expansion. 
Specifically, for a given signal Jf €R’, consider the following two sets of matrices, 
®,® ¢ R’*" and W,W © R’*’, such that they satisfy the following frame 
condition[42]: 


éo'=-7, wh! =7,. (10.19) 
Then, we have the following trivial equality: 
E’'(f) = 60 HY (pywh! = dco", (10.20) 
where 
C=O'H(f)W eR’, (10.21) 
whose (i, j)-th element is given by 
cj =O) W(f)¥; = (F.6, @V;), (10.22) 


where @; and wy; denote the i-th and the j-th column vector of ® and W, 
respectively, and the last equality of (10.22) comes from Lemma 10.1. 


Now, we define an inverse Hankel operator Ht ):R™" Ls R” such that for any 


f €R’, the following equality satisfies 
f=H© (H(f)). (10.23) 


Then, the following key equality can be obtained [42]: 


eee lo ~ 
HO) (ecw") =e a (®c;) @ Vj (10.24) 
j=l 


I ro 
— aij Gi ® ¥))- (10.25) 
i,j 
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By combining (10.25) with (10.20) and (10.22), we have 


1 —— 
[= — DMF b; ® ¥j) (¢; ® Wj). (10.26) 


ij 


This implies that {@; @ Wii constitutes a frame for R” and (6; ® Wij 
corresponds to its dual frame. Furthermore, for many interesting signals f in real 
applications, the Hankel matrix H?(f) has low-rank structures [110-112], which 
makes the expansion coefficients cj; nonzero only at small index sets. Therefore, 
the convolution framelet expansion is a concise signal representation similar to the 
wavelet frames [42, 109]. 

In the convolution framelet, the functions @,, d; correspond to the global basis, 
whereas w;, Vv; are local basis functions. Therefore, by the convolution between 
the global and local basis to generate a new frame basis, convolution framelets can 
exploit both local and global structures of signals [42, 109], which is an important 
advance in signal representation theory. 


10.3.3 Link to CNN 


Although the convolution framelet is a linear representation, the reason we care 
about it so much is that it reveals the role of the pooling and convolution filters in 
CNNs. More specifically, using (10.17), we can show that the convolution framelet 
coefficient matrix C in (10.21) can be represented by 


C=[e1--- ¢,| 
=O H(pw=o' (few), (10.27) 
where 
feWv:=[fey,--fey,] (10.28) 


which corresponds to the single-input multi-output (SIMO) convolution. Note that 

the convolution operation is local since the filter weights are multiplied with the 

pixels within the receptive field. After the convolution operation, ®' is multiplied 

with all elements of the filtered output, which corresponds to the global operation. 
On the other hand, by combining (10.24) with (10.20), we have 


f= - > > (Gc) @ Wj. (10.29) 


10.3 Convolution Framelets 207 


(a) Conv (encoder) 
® Conv (decoder) 


>i > = = a | B 
f Cay Ce C3) te Cm Ca fF 


Fig. 10.4 Single-resolution encoder—-decoder networks. (a) single-level convolutional framelet 
decomposition with identity pooling. (b) multi-level convolutional framelet deconvolution with 
identity pooling 
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Fig. 10.5 Multi-resolution encoder—decoder networks 


which shows the processing step of the framelet coefficient C at the decoder. 
More specifically, we apply the global operation W toc j first, after which mullti- 
input single-output (MISO) convolution operation is performed to obtain the final 
reconstruction. 

In fact, the order of these signal processing operations is very similar to the 
two-layer encoder—decoder architecture, as shown in Figs. 10.4 and 10.5. At the 
encoder side, the SIMO convolution operation is performed first to generate multi- 
channel feature maps, after which the global pooling operation is performed. At the 
decoder side, the feature map is unpooled first, after which the MISO convolution 
is performed. Therefore, we can easily see the important analogy: the convolution 
framelet coefficients are similar to the feature maps in CNNs, and ®, ® work as a 
pooling and unpooling layers, respectively, whereas WV, v correspond to the encoder 
and decoder filters, respectively. This implies that the pooling operation defines the 
global basis, whereas the convolution filters determine the local basis, and the CNN 
tries to exploit both global and local structure of the signal. 
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Fig. 10.6 Encoder—decoder CNNs without skip connection 


Furthermore, by simply changing the global basis, we can obtain various network 
architectures. For example, in Fig. 10.4, we use ® = ® = J,,, whereas we use the 
Haar wavevelet transform as global pooling for the case of Fig. 10.5. 


10.3.4 Deep Convolutional Framelets 


Now, we are ready to explain the multilayer convolution framelets, which we call 
deep convolutional framelets [42]. For simplicity, we consider encoder—decoder 
networks without skip connections, as shown in Fig. 10.6, although the analysis 
can be applied equally well when the skip connections are present. Furthermore, we 
assume symmetric configuration so that both encoder and decoder have the same 
number of layers, say «; the input and output dimensions for the encoder layer &! 
and the decoder layer D! are symmetric: 


RoR! DO RAKR*!, Le [kl], (10.30) 


where [nm] denotes the set {1,---,m}. At the /-th layer, m; and q; denote the 
dimension of the signal, and the number of filter channels, respectively. The length 
of filter is assumed to be r. 

We now define the /-th layer input signal for the encoder layer from q;—1-input 
channels, 


- 
ghia ial | eRe, (10.31) 


i-1 
j 
with the dimension m)_,. The /-th layer output signal z! is similarly defined. Note 
that the filtered output is now stacked as a single column vector in (10.31), which 
is different from the former treatment at the convolution framelet where the filter 
output for each channel is stacked as an additional column. It turns out that the 
notation in (10.31) makes the mathematical derivation for multilayer convolutional 
neural networks much more trackable than the former notation, although the role of 
the global and local basis are clearly seen in the former notation. 


where ' denotes the transpose, and z € R”-! refers to the j-th channel input 
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Then, for the linear encoder—decoder CNN without skip connections, as shown 
in Fig. 10.6a, we have the following linear representation at the /-th encoder layer 
[35]: 


Za ET I1 (10.32) 
where 
e @y, dies OW) 
E = ey : ; (10.33) 
1 i 1 1 
'® Vas eS! Vara 


where ©! denotes the m; x m; matrix that represents the pooling operation at the 
l-th layer, and vi Jie R’ represents the /-th layer encoder filter to generate the i- 


th channel output from the contribution of the j-th channel input, and 6! ® Vi ; 
represents a single-input multi-output (SIMO) convolution [35]: , 


#' ey, = [4 ev, --¢@ vii (10.34) 


Note that the inclusion of the bias can be readily done by including additional rows 
into E! as the bias and augmenting the last element of z!~! by 1. 
Similarly, the /-th decoder layer can be represented by 


Z-1= plz! (10.35) 


where 


D! = tb (10.36) 
q-1,41 


Pal : ; : 
where ® denotes the m; x m; matrix that represents the unpooling operation at the 


l-th layer, and Vij € R’ represents the /-th layer decoder filter to generate the i-th 
channel output from the contribution of the j-th channel input. 

Then, the output v of the encoder-decoder CNN with respect to input z can be 
represented by the following representation [35]: 


v =Te(z) = >_ (bj, 2) bi (10.37) 


L 
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where © refers to all encoder and decoder convolution filters, and b; and b; denote 
the i-th column of the following matrices, respectively: 


B= E'E’...E* ,B=D'D’..-. DX (10.38) 


Note that this representation is completely linear, since the representation does 
not vary once the network parameters © are trained. Furthermore, consider the 
following multilayer frame conditions for the pooling and filter layers: 


. : 1 
oo Hal, WW =—Tyg,, VI, (10.39) 
ra 


where I, denotes the n x n identity matrix and aw > 0 is a nonzero constant, and 


i I 
ho Wo 
wi = my Ss , (10.40) 
l eee l 
l,qi-1 q.4i-1 
~| ~| 
Vit @? Fig 
Vv = rise : , (10.41) 
ai wif 
Vail — Vasa 


Under these frame conditions, we showed in [35] that (10.37) satisfies the perfect 
reconstruction condition, i.e 


e=Lak) =), bia e, (10.42) 


i 


hence the corresponding deep convolutional framelet is indeed a frame representa- 
tion, similar to wavelet frames [113]. 

In the deep convolutional framelets, all the encoder and decoder filters can 
be estimated from the training data set; hence, it is a data-driven model. More 
specifically, for the given training data {x;, y;}/_,, the CNN parameter © is 
estimated by solving the following optimization problem: 


min 2! (y;, Lo(xi)) + AR(O). (10.43) 


Once the parameter @ is learned, the encoder and decoder matrices E! and D! are 
determined. Therefore, the representations are entirely data-driven and dependent 
on the filter sets that are learned from the training data set, which is different from 
the classical kernel machine or basis pursuit approaches, where underlying kernels 
or frames are specified in a top-down manner. 
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That said, the deep convolutional framelet does not yet meet the desiderata of the 
machine learning, since once it is trained, the frame representation does not vary, 
hence the data-driven adaptation is not possible. In the next section, we will show 
that the last missing element is the nonlinearity such as ReLU, which plays key roles 
in machine learning. 


10.4 Geometry of CNN 


10.4.1 Role of Nonlinearity 


In fact, the analysis of deep convolutional framelets with the ReLU nonlinearities 
turns out to be a simple modification, but it provides very fundamental insights on 
the geometry of the deep neural network. 

Specifically, in [35] we showed that even with ReLU nonlinearities the expres- 
sion (10.37) is still valid. The only change is that the basis matrices have additional 
ReLU pattern blocks in between encoder, decoder, and skipped blocks. For example, 
the expression in (10.38) is changed as follows: 


B(z) = E'A'(z)E*A*(z)--- A! (z)E*, (10.44) 
Biz) = D'A' (2) D?A°(z)--- A (DD, (10.45) 


where A! (z) and K'(z) are the diagonal matrices with 0 and 1 elements indicating 
the ReLU activation patterns. 

Accordingly, the linear representation in (10.37) should be modified as a 
nonlinear representation: 


v =Te(z) = >> (bi(Z), z) bi(@), (10.46) 


i 


where we now have an explicit dependency on z for b;(z) and bj (z) due to the input- 
dependent ReLU activation patterns, which makes the representation nonlinear. 

Again the filter parameter © is estimated by solving the optimization problem in 
(10.43) by replacing £Le(z) with T@(z) in (10.46). Therefore, the representations 
are entirely data-driven. 


10.4.2 Nonlinearity Is the Key for Inductive Learning 


In (10.44) and (10.45), the encoder and decoder basis matrices have an explicit 
dependence on the ReLU activation pattern on the input. Here we will show that 
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Fig. 10.7 Reconstruction principle of deep learning 


this ReLU-activation-dependent diagonal matrix provides a key role in enabling 
inductive learning. 

Specifically, the nonlinearity is applied after the convolution operation, so the on- 
and-off activation pattern of each ReLU determines a binary partition of the feature 
space at each layer across the hyperplane that is determined by the convolution. 
Accordingly, in deep neural networks, the input space is partitioned into multiple 
non-overlapping regions so that input images for each region share the same linear 
representation, but not across the partition. This implies that two different input 
images are automatically switched to two distinct linear representations that are 
different from each other, as shown in Fig. 10.7. 

This leads to an important insight: although the CNN approach and the basis 
pursuit in Fig. 10.3 appear to be two completely different approaches, there exists a 
very close relationship between the two. Specifically, the CNN is indeed similar to 
the classical basis pursuit algorithm that searches for the distinct linear representa- 
tion for each input, but in contrast to the basis pursuit, the CNN is inductive since 
it does not solve the optimization problem for a new input, rather it only switches 
to different frame representations by changing the ReLU activation patterns. This 
inductivity from the learned filter coefficients is an important advance over the 
classical signal processing approach. 


10.4.3 Expressivity 


Given the partition-dependent framelet geometry of CNN, we can easily expect that 
with a greater number of input space partitions, the nonlinear function approxima- 
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Fig. 10.8 Expressivity increases exponentially with channels, depth, and skip connections 


tion by the piecewise linear frame representation becomes more accurate. Therefore, 
the number of piecewise linear regions is directly related to the expressivity or 
representation power of the neural network. If each ReLU activation pattern is 
independent of the others, then the number of distinct ReLU activation patterns 
s 2#ofneurons Where the number of neurons is determined by the number of the 
entire features. Therefore, the number of distinct linear representation increases 
exponentially with the depth, width, and skip connection as shown in Fig. 10.8 [35]. 
This again confirms the expressive power of CNN thanks to the ReLU nonlinearities. 


10.4.4 Geometric Meaning of Features 


One of the interesting questions in neural networks is understanding the meaning 
of the intermediate features that are obtained as an output of each layer of neural 
network. Although these are largely regarded as latent variables, to our best 
knowledge the geometric understanding of each latent variable is still not complete. 
In this section, we show that this intermediate feature is directly related to the 
relative coordinates with respect to the hyperplanes that partition the product space 
of the previous layer features. 
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To understand the claim, let us first revisit the ReLU operation for each neuron at 
the encoder layer. Let E : denote the i-th column of encoder matrix E! and zibe the 
i-th element of z’. Then, the output of an activated neuron can be represented as: 


E} ||, (10.47) 


distance to the hyperplane 
where the normal vector of the hyperplane can be identified as 
n! = Ei. (10.48) 
L 


This implies that the output of the activated neuron is the scaled version of the 
distance to the hyperplane which partitions the space of feature vector z/~! into 
active and non-active regions. Therefore, the role of the neural network can be 
understood as representing the input data with a coordinate vector using the relative 
distances with respect to multiple hyperplanes. 

In fact, the aforementioned interpretation of the feature may not be novel, since 
a similar interpretation can be used to explain the geometrical meaning of the 
linear frame coefficients. Instead, one of the most important differences comes from 
the multilayer representation. To understand this, consider the following two layer 
neural network: 


J =o(ET Zh, (10.49) 
where 


galas (ai) = A(gi!) BEHDT gl? (10.50) 


where A(z!—!) again encodes the ReLU activation pattern. Using the property of 
the inner product and adjoint operator, we have 


2 =o (EIT Z!-!) 
Sig (EL AG EI OZ?) 


=% (AG DE}, EV= a7) ; (10.51) 


This indicates that on the space of the unconstrained feature vector from the 
previous layer (i.e. no ReLU is assumed), the hyperplane normal vector is now 
changed to 


n! = A(z!-)E!. (10.52) 
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Fig. 10.9 Two-layer neural network with two neurons for each layer. Blue arrows indicate the 
normal direction of the hyperplane. The black lines are hyperplanes for the first layers, and the red 
lines correspond to the second layer hyperplanes 


This implies that the hyperplane in the current layer is adaptively changed with 
respect to the input data, since the ReLU activation pattern in the previous layer, 
ie. A(z!~), can vary depending on inputs. This is an important difference over 
the linear multilayer frame representation, whose hyperplane structure is the same 
regardless of different inputs. 

For example, Fig. 10.9 shows a partition geometry of R* by a two-layer neural 
network with two neurons at each layer. The normal vector directions for the second 
layer hyperplanes are determined by the ReLU activation patterns such that the 
coordinate values at the inactive neuron become degenerate. More specifically, for 
the (A) quadrant where two neurons at the first layers are active, we can obtain two 
hyperplanes in any normal direction determined by the filter coefficients. However, 
for the (B) quadrant where the second neuron is inactive, the situation is different. 
Specifically, due to (10.52), the second coordinate of the normal vector, which 
corresponds to the inactive neuron, becomes degenerate. This leads to the two 
parallel hyperplanes that are distinct only by the bias term. A similar phenomenon 
occurs for the quadrant (C) where the first neuron is inactive. For the (D) quadrant 
where two neurons are inactive, the normal vector becomes zero and there exists no 
partitioning. Therefore, we can conclude that the hyperplane geometry is adaptively 
determined by the feature vectors in the previous layer. 

In the following, we provide several toy examples in which the partition geometry 
can be easily calculated. 
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Fig. 10.10 An example two-layer neural network 


Problem 10.1 (Partition Geometry of Two-Layer Neural Network in R*) Consider 
a two layer fully connected network fe : R* — R?* with ReLU nonlinearity, as 
shown in Fig. 10.10. 


(a) Suppose the weight matrices and biases are given by 


Draw the corresponding input space partition, and compute the output mapping 
with respect to an input vector (x, y) in each input partition. Please derive all 
the steps explicitly. 

In problem (a), suppose that the bias terms are zero. Compute the input space 
partition and the output mapping. What do you observe compared to the one 
with bias? 

In problem (a), suppose that the second layer weight and bias are changed as 


12 0 
wo — D = 
~ E | oe Hl ; 


Draw the corresponding input space partition, and compute the output mapping 
with respect to an input vector (x, y) in each input partition. Compared to the 
original problem in (a), what do you observe? 


(b 


we 


(c 


wm 
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Solution 10.1 


(a) Letx =[x, y]' € R’. At the first layer, the output signal is given by 


o) =o (Wx +5) =| e7ee—et D 
o(x+y-—1) 


where o is the ReLU. Now, at the second layer, we need to consider all cases 
where each ReLU is active or inactive. 


(i) If 2x —-y +1 < Oandx+y—1 < 0, theno™ = [0,0]',0® = 
o (Wo + b) = o[-9, —2]" = [0, 0]". 

(ii) If 2x —y+1> O0andx+y—1 <0, theno™ = [2x —y+1,0]'. 
Hence, 0 = o (W0 +b) = o ([2x — y—8,-2x + y—3])!. 
Therefore, 


92) — {0 ol, 2x-y—8 <0, 
[2x — y — 8, 0]', otherwise. 


(iii) If 2x —y+1 <Oandx+y—1> 0,theno™ = [0,x+y—1]! ando® = 
o (Wo) + 5) =o ([2x +2y —11,x+ y —3])'. Therefore, 


[0, OJ’, x+y-3<0, 
o = {10,x+y—3]", 9% + 2y— 11 <0.x+7—3 2 0, 
[2x +2y —11,x+y—3]', otherwise. 


(iv) If2x —y+1>Oandx+y—1>0,theno™ =[2x-y+1,x+y-— 
1]' ando® =o (Wo +b) = o ([4x + y — 10, —x + 2y — 4)". 


Therefore, 
[0, 0], 4x +y—10<0,-x +2y-4 <0, 
5@ = [4x + y — 10, OJ", 4x +y—10<0,-x+2y-—420, 
[0, —x +2y —4]", 4x +y—10>0,-x+2y-—4 <0, 


[4x + y —10,-x+2y—4]', otherwise. 


The resulting input space partition is shown in Fig. 10.11, where the 
corresponding linear mapping and its rank are illustrated. Note that around 
the two full rank partitions, there exist rank-1 mapping partitions, which 
join with the rank-O mapping partition. 
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Fig. 10.11 Input space partitioning for the problem (a) case 


(b) At the first layer, the output signal is given by 


(1) _ (0) (0) _ | 72x —y) 
7) o (Ww x+b a eee ; 


where o is the ReLU. At the second layer, we again consider all cases where 
each ReLU is active or inactive. 


(i) If2x — y < Oandx +y <0, then o = [0,0]',0% =o (Wo) = 
[0, Oo]. 
(ii) If 2x — y > Oandx + y <0, then o“ = [2x — y, 0]. Hence, 
0° =o (Wo +b) =o (2x — y,-2x + y)" =2x-y, 0)". 
(iii) If 2x — y < Oand x + y > 0, then 0“? = [0, x + y]]' and 


0° <0 (Wo) =o (2x +2y,x-+ yD" = [2x +2y,x 49)". 


(iv) If 2x — y > Oandx + y > 0, then o!) = [2x — y,x + y]! and o® = 
a (WY) + bY) = a ([4x + y, —x + 2y])". Therefore, 


92) — [ae +y, ol, —x+2y <0, 
[4x + y,-x+2y]', otherwise. 
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Fig. 10.12 Input space partitioning for the problem (b) case 


The resulting input space partition is shown in Fig. 10.12, where the corre- 
sponding linear mapping and its rank are illustrated. Similar to problem (a), 
around the two full rank partitions, there exist rank-1 mapping partitions, which 
join with the rank-O0 mapping partition. Since there is no bias term, all the 
hyperplanes should contain the origin. Also, there are no hyperplane with same 
normal vector, since parallel hyperplanes cannot be formed without bias terms. 
As a result, the input space partition becomes simpler compared to (a). 

At the first layer, the output signal is given by 


0 <0 (Wx +p) = be = 7 
o(x+y-—1) 


(c 


wm 


where o is the ReLU. Now, at the second layer, we need to consider all cases 
where each ReLU is active or inactive. 


(i) If 2x —y +1 < Oandx+y-—1 < 0, theno™ = [0,0]',0® = 

o (Wo +b) = o[0, 1]" = [0, 1]". 

(ii) If2x —y+1>Oandx+y—1 <0, then o!) = [2x —y+1,0]'. Hence, 
0 =o (WYVo) +b) =o (2x —y +1, 1))' =[2x-y +1, 1)". 

(iii) If 2x —y+1 <Oandx+y—1> 0,theno™ = [0,x+y—1]! ando® = 

a (WY + bY) = o ([2x +2y —2,x+ yl)! = [2x+2y—2, x+y]". 
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Fig. 10.13 Input space 
partitioning for the problem 
(c) case 


(iv) If2x -—y+1>Oandx+y—1>0,theno™ = [2x-—y+1,x+y—1]" 
and 0® =o (Wo) +b) =o (4x + y—1, x+y]! =[4xt+y- 
1, x+y]. The resulting input space partition is shown in Fig. 10.13, where 
the corresponding linear mapping and its rank are illustrated. There is no 
hyperplane formed by the second layer. This shows how weight and bias 
change the complexity of the input partition. 


10.4.5 Geometric Understanding of Autoencoder 


We are now interested in providing a more in-depth discussion on geometry of deep 
neural networks for regression problems, in particular, autoencoder. Autoencoders 
have the same input and output domains, and are commonly used for low-level 
computer vision problems, such as image denoising [90, 98], super-resolution 
[99, 100] and so on. Although we provide a discussion on the autoencoder, similar 
geometric understanding can be applied to other regression problems, where the 
input and output domains are different. Later we will show that the geometric 
understanding of the autoencoder also gives a clear insight on the geometry of 
classifiers. 

Based on the discussion so far, we now understand that the deep neural network 
with ReLU nonlinearities partitions the input data space into piecewise linear 
regions. In fact, this view is directly related to the manifold structure of the data, 
and we believe that the main fundamental principle to explain the success of deep 
learning is its efficient use of the manifold structure in the data. 

First, we provide some differential geometric definition. 
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Fig. 10.14 Manifold geometry of autoencoder [114] 


Definition 10.1 An n-dimensional manifold is a topological space, covered by a 
set of open sets 2 C U,U,y. For each open set Uy, there exists a homeomorphism 
Yq : Uy + R", and the pair (Uy, ¢,) form a chart. The union of the chart forms an 
atlas. A = {(Ug, Ga) }. 


As shown in Fig. 10.14, suppose X is the ambient space, jz is a probability 
distribution defined on X. The support of j2, 


U(w) := {x EX : w(x) > O} (10.53) 


is a low-dimensional manifold in X. For a given local chart (Ug, Ga), Pa 1 Ua RF 
is called an encoder, where F¥ is called the latent space or feature space. A point 
x € & is called a sample; its image g(x) is the corresponding feature of x. The 
inverse map Wy := 05! :Ft+» DZ is called the decoder [114]. 

Then, an autoencoder consists of two parts, the encoder and the decoder. The 
encoder takes a sample x € X and maps it to the feature map z € F, z = (x). 
The encoder gy : X t F maps © to its latent representation D = (2) 
homomorphically. After that, the decoder W : Ft X maps Z to the reconstruction 
x of the same shape as x, 


X= VW(zZ)=Ypog(x). 


This relation can be seen in the following commutative diagram [114]: 


{(X, x), u, 2} —“- {(F,z), D} 
yor \y 
= ~~ 
{X, ®), se, D} 
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In practice, both encoder and decoder are parameterized with the parameter O, 
so that the autoencoder is described by 


X¥ =Te(x) = veo ge(x) 


and the parameter estimation problem can be solved by 


min )/£ (vi, To(xi)) + RO), (10.54) 


i=1 


which is the same as the CNN training. 

Figure 10.14 shows an example geometry of each step of the autoencoder with 
ReLU nonlinearities. Here, the ambient space X is R? and the feature space is two- 
dimensional, i.e. f C R?. The sample x is a 3-D point so that the input manifold 
M := X(w) C X is a two-dimensional surface within R?, which is low-dimensional 
(see Fig. 10.14). The input samples are mapped to the feature space manifold in 
Fig. 10.14 using the parameterized encoder gg. Then, this feature manifold is 
mapped back to the original ambient space using the parameterized decoder We 
as Fig. 10.14. Due to the ReLU nonlinearities, the input manifold M is partitioned 
into piecewise linear regions D(g@). 

The specific operation on each piecewise linear region is then defined during 
the training phase. For example, in Fig. 10.15, the input manifold is a noisy point 
cloud, whereas the label data at the output are the noiseless 3D surfaces. During 
the training, the specific operation of the neural network is guided as a low-rank 
mapping on the reconstruction manifold, as discussed in Problem 10.1. Therefore, 
the noisy outliers from the input manifold are projected into the reconstruction 
manifold via a trained neural network, which is piecewise linear at each cell but 
globally nonlinear [114]. 


10.4.6 Geometric Understanding of Classifier 


The geometric understanding of the autoencoder now gives a clear picture of what 
happens in the deep neural network classifier. In this case, we only have an encoder 
to map to the latent space, which leads to a simplified commutative diagram: 


{(X, x), u, Z} ——> (F. 2), D}. 


Since the encoder is also parameterized with © and equipped with a ReLU, the input 
manifold is also partitioned into piecewise linear regions, as shown in Fig. 10.14d. 
Then, the linear layer followed by softmax assigns to the class probability for each 
piecewise linear cell. 
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Fig. 10.15 Denoising as a piecewise linear projection on the reconstruction manifold [114] 


10.5 Open Problems 


Our discussion so far reveals that the deep neural network is indeed trained to 
partition the input data manifold such that the linear mapping at each piecewise 
linear region can effectively perform machine learning tasks, such as classification, 
regression, etc. Therefore, we strongly believe that the clue to unveil the mystery 
of deep neural networks comes from the understanding of the high-dimensional 
manifold structure and its piecewise linear partition, and how the partitions can be 
controlled. 

In fact, many machine learning theoreticians have been focusing on this, thereby 
generating many intriguing theoretical and empirical observations [115-118]. For 
example, although we mentioned that the number of linear regions can potentially 
increase exponentially with the network complexity, they observed that the actual 
number of piecewise linear representation for specific tasks is much smaller. For 
example, Fig. 10.16 shows that the number of linear regions indeed converges to 
a smaller value compared to the initialization as the number of epochs increases 
[113,116]. 


224 10 Geometry of Deep Neural Networks 


Epoch 0: 9744 regions Epoch 1: 4196 regions Epoch 20: 8541 regions 
m. f : - : , , - . “ = ; ’ 


Fig. 10.16 Here the authors [115, 116] show the linear regions that intersect a 2D plane through 
input space for a network of depth 3 and width 64 trained on MNIST 


Batch-norm Dropout Vanilla 


Fig. 10.17 Linear regions and classification regions of models trained with different optimization 
techniques [117] 


Classification 
Regions 


Linear 
Regions 


Note that only the number of epochs determines the number of piecewise linear 
regions, but also, depending on the choice of the optimization algorithms, the 
number of linear regions varies. For example, Fig. 10.17 shows that the number 
of linear regions varies depending on the optimization algorithms, which leads to 
the different classification boundaries. Here, the gray curves in the bottom row are 
transition boundaries separating different linear regions, and the color represents 
the activation rate of the corresponding linear region. In the top row, different colors 
represent different classification regions, separated by the decision boundaries. The 
models were trained on the vectorized MNIST data set, and this figure shows a 
two-dimensional slice of the input space. 

In fact, this phenomenon can be understood as a data-driven adaptation to 
eliminate the unnecessary partitions for machine learning tasks. Note that the 
partition boundary can collapse, resulting in a smaller number of partitions, as 
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discussed in Problem 10.1(c). It is believed that there is a compromise between 
the approximation error and the robustness of the neural network in terms of the 
number of piecewise linear areas. Many of these questions remain unanswered, and 
many research efforts need to clearly understand the partition geometry of the neural 
network. 

Finally, while it was largely disregarded during our discussion, for the case 
of CNNs, the choice of the hyperplanes becomes further constrained due to the 
convolutional relationship. For example, to encode the data manifold in R? with 
the r = 2 convolution filter with the filter coefficient of [1, 2], the following three 
vectors determine the normal direction of the three hyperplanes: 


ni =[120], m)=[012], n5=[201], (10.55) 


where we assume the circular convolution and no pooling operation (i.e. @! = [3). 
This implies that each channel of the convolution filter determines an orthant 
of the underlying feature space, and the feature vectors are directly related to 
the coordinate on the resultant orthant. Therefore, understanding piecewise linear 
regions in CNNs requires a much more in-depth understanding of the high- 
dimensional geometry, which may be another very exciting research topic. 


10.6 Exercises 


. Prove (10.24). 

. Prove the equality (10.25). 

. Fill in the missing step in (10.26). 

. Show (10.29). 

. Our goal is to derive the input—output relation in (10.32) at the encoder. 


(a) Show that 


ABW N Re 


= = sl 
@aeyi a =O | @v),)- (10.56) 


(b) Using (10.56), prove (10.32). 
6. Our goal is to derive the input-output relation in (10.35) at the decoder. 
(a) Show that 


~/ wt 
(© OV) =O ROV iE (10.57) 


(b) Using (10.57), prove (10.35). 


7. Under the frame condition (10.39), derive the perfect reconstruction condition in 
(10.42). 
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8. Consider a three-layer fully connected network fo : R* — R? with ReLU 
nonlinearity. 


(a) Suppose the weight matrices and biases are given by 


Draw the corresponding input space partition, and compute the output 
mapping with respect to an input vector (x, y) in each input partition. Please 
derive all the steps explicitly. 

In problem (a), suppose that the bias terms are zero. Compute the input space 
partition and the output mapping. What do you observe compared to the one 
with bias? 

In problem (a), the last layer weight W) and bias b” are changed due to 
the fine tuning. Please give an example of W) and bias b) that gives the 
smallest number of partitions. 


(b 


wm 


(c 


wm 


Chapter 11 ®) 
Deep Learning Optimization spooks 


11.1 Introduction 


In Chap. 6, we discussed various optimization methods for deep neural network 
training. Although they are in various forms, these algorithms are basically gradient- 
based local update schemes. However, the biggest obstacle recognized by the entire 
community is that the loss surfaces of deep neural networks are extremely non- 
convex and not even smooth. This non-convexity and non-smoothness make the 
optimization unaffordable to analyze, and the main concern was whether popular 
gradient-based approaches might fall into local minimizers. 

Surprisingly, the success of modern deep learning may be due to the remarkable 
effectiveness of gradient-based optimization methods despite its highly non-convex 
nature of the optimization problem. Extensive research has been carried out 
in recent years to provide a theoretical understanding of this phenomenon. In 
particular, several recent works [119-121] have noted the importance of the over- 
parameterization. In fact, it was shown that when hidden layers of a deep network 
have a large number of neurons compared to the number of training samples, the 
gradient descent or stochastic gradient converges to a global minimum with zero 
training errors. While these results are intriguing and provide important clues for 
understanding the geometry of deep learning optimization, it is still unclear why 
simple local search algorithms can be successful for deep neural network training. 

Indeed, the area of deep learning optimization is a rapidly evolving area of 
intense research, and there are too many different approaches to cover in a single 
chapter. Rather than treating a variety of techniques in a disorganized way, this 
chapter explains two different lines of research just for food for thought: one is based 
on the geometric structure of the loss function and the other is based on the results 
of Lyapunov stability. Although the two approaches are closely related, they have 
different advantages and disadvantages. By explaining these two approaches, we can 
cover some of the key topics of research exploration such as optimization landscape 
[122-124], over-parameterization [119, 125-129], and neural tangent kernel (NTK) 
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[130-132] that have been used extensively to analyze the convergence properties of 
local deep learning search methods. 


11.2. Problem Formulation 


In Chap. 6, we pointed out that the basic optimization problem in neural network 
training can be formulated as 


min £(6), (11.1) 
OER" 


where 9 refers to the network parameters and £ : R” + Ris the loss function. In the 
case of supervised learning with the mean square error (MSE) loss, the loss function 
is defined by 


1 
£@) == lly — fol’, (11.2) 
where x, y denotes the pair of the network input and the label, and f‘g(-) is a neural 


network parameterized by trainable parameters 0. For the case of an L-layer feed- 
forward neural network, the regression function f (x) can be represented by 


fo(x) i= (cog og og’..-0 8") (x), (11.3) 


where o(-) denotes the element-wise nonlinearity and 


g) = Wo) 4 AED, (11.4) 
0 =o(g"), (11.5) 
oO =x, (11.6) 
for / = 1,--- , L. Here, the number of the /-th layer hidden neurons, often referred 


to as the width, is denoted by d™, so that g, 0 € Re and WO € RIOxde” 
The popular local search approaches using the gradient descent use the following 
update rule: 


at(0) 


O[k + 1] = O[k] — ne —— : 
00 |o—ork4 


(11.7) 
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where nx denotes the k-th iteration step size. In a differential equation form, the 
update rule can be represented by 


_ d€OLK]) 


6[r] = 7 


(11.8) 


where 6[t] = d0[t]/dt. 

As previously explained, the optimization problem (11.1) is strongly non-convex, 
and it is known that the gradient-based local search schemes using (11.7) and (11.8) 
may get stuck in the local minima. Interestingly, many deep learning optimization 
algorithms appear to avoid the local minima and even result in zero training errors, 
indicating that the algorithms are reaching the global minima. In the following, we 
present two different approaches to explain this fascinating behavior of gradient 
descent approaches. 


11.3. Polyak—Lojasiewicz-Type Convergence Analysis 


The loss function £ is said to be strongly convex (SC) if 
/ / Mor 2 / 
£(6") = (8) + (VEO), 8° — 8) + = I10" — Bll”, V0, 6". (11.9) 


It is known that if £ is SC, then gradient descent achieves a global linear convergence 
rate for this problem [133]. Note that SC in (11.9) is a stronger condition than the 
convexity in Proposition 1.1, which is given as 


£(0') = €(0) + (VL(0), 0’ — 0), 0,0". (11.10) 


Our starting point is the observation that the convex analysis mentioned above 
is not the right approach to analyzing a deep neural network. The non-convexity 
is essential for the analysis. This situation has motivated a variety of alternatives 
to the convexity to prove the convergence. One of the oldest of these conditions 
is the error bounds (EB) of Luo and Tseng [134], but other conditions have been 
recent considered, which include essential strong convexity (ESC) [135], weak 
strong convexity (WSC) [136], and the restricted secant inequality (RSD [137]. 
See their specific forms of conditions in Table 11.1. On the other hand, there 
is a much older condition called the Polyak—Lojasiewicz (PL) condition, which 
was originally introduced by Polyak [138] and found to be a special case of the 
inequality of Lojasiewicz [139]. Specifically, we will say that a function satisfies 
the PL inequality if the following holds for some pz > 0: 


slIve@ i? > w(l(0) — e*), VO. (11.11) 
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Note that this inequality implies that every stationary point is a global minimum. 
But unlike SC, it does not imply that there is a unique solution. We will revisit this 
issue later. 

Similar to other conditions in Table 11.1, PL is a sufficient condition for 
gradient descent to achieve a linear convergence rate [122]. In fact, PL is the 
mildest condition among them. Specifically, the following relationship between the 
conditions holds [122]: 


(SC) — (ESC) > (WSC) — (RSI) > (EB) = (PL), 
if € have a Lipschitz continuous gradient, i.e. there exists L > 0 such that 
Veo) — Ve’) || < LI|@—9'||, vO, 0". (11.12) 


In the following, we provide a convergence proof of the gradient descent method 
using the PL condition, which turns out to be an important tool for non-convex deep 
learning optimization problems. 


Theorem 11.1 (Karimi et al. [122]) Consider problem (11.1), where € has an 
L-Lipschitz continuous gradient, a non-empty solution set, and satisfies the PL 
inequality (11.11). Then the gradient method with a step-size of 1/L: 


O[k + 1] = O[k] — ~ve(olk) (11.13) 


has a global convergence rate 


k 
C(otk]) — & = (1-5) (e¢@(0}) — e*). 


Proof Using Lemma 11.1 (see next section), L-Lipschitz continuous gradient of the 
loss function £ implies that the function 


L 
g(6) = Fell’ — £6) 
is convex. Thus, the first-order equivalence of convexity in Proposition 1.1 leads to 
the following: 
L 12 / L 2 / 
ale —£@) = Filell” — £@) + (6 — 8, LO — VE(6)) 


= — F101? — (0) + L(6’, 0) — (0 — 0, VL). 
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By arranging terms, we have 
/ / L / 2 / 
£(0') < £(0) + (VL(0), 0 —O) + ral —6\|\", VO,0. 
By setting 6’ = 6[k + 1] and 6 = 0[k] and using the update rule (11.13), we have 
1 
Clk + 1]) — €([k]) < sy llVeorADIr’. (11.14) 
Using the PL inequality (11.11), we get 
C(O[k + 11) — &(@[k]) < -_ (e(@[k]) — e*). 


Rearranging and subtracting £* from both sides gives us 


fa 


elk + 1) — e* < ( 7 


) (e(@[k}) — e*). 


Applying this inequality recursively gives the result. Oo 


The beauty of this proof is that we can replace the long and complicated proofs from 
other conditions with simpler proofs based on the PL inequality [122]. 


11.3.1 Loss Landscape and Over-Parameterization 


In Theorem 11.1, we use the two conditions for the loss function: (1) & satisfies 
the PL condition and (2) the gradient of ¢ is Lipschitz continuous. Although these 
conditions are much weaker than the convexity of the loss function, they still impose 
the geometric constraint for the loss function, which deserves further discussion. 


Lemma 11.1 /f the gradient of £(0) satisfies the L-Lipschitz condition in (11.12), 
then the transformed function g : R" > R defined by 


Lat 
g(0) = 500 — 46) (11.15) 


is convex. 


Proof Using the Cauchy—Schwarz inequality, (11.12) implies 


(Ve(0) — Ve(6'),6 — 6’) < L|l}6 — 0", V0, 6’. 
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This is equivalent to the following condition: 
(0 — 0, Vg(0') — Vg(0)) > 0, 0,6’, (11.16) 


where 
L 
g(6) = > I10Il” — €(6). 


Thus, using the monotonicity of gradient equivalence in Proposition 1.1, we can 
show that g(@) is convex. oO 


Lemma 11.1 implies that although @ is not convex, its transformed function by 
(11.15) can be convex. Figure 11.1a shows an example of such case. Another impor- 
tant geometric consideration for the loss landscape comes from the PL condition. 
More specifically, the PL condition in (11.11) implies that every stationary point is 
a global minimizer, although the global minimizers may not be unique, as shown in 
Fig. 11.1b,c. While the PL inequality does not imply convexity of @, it does imply 
the weaker condition of invexity [122]. A function is invex if it is differentiable and 
there exists a vector-valued function 7 such that for any @ and 6’ in R”, the following 
inequality holds: 


£(0’) > &(0) + (Ve(B), n(8, 6’). (11.17) 


A convex function is a special case of invex functions since (11.17) holds when we 
set (0, 0’) = 0’ — @. It was shown that a smooth @ is invex if and only if every 
stationary point of £ is a global minimum [140]. As the PL condition implies that 
every stationary point is a global minimizer, a function satisfying PL is an invex 
function. The inclusion relationship between convex, invex, and PL functions is 
illustrated in Fig. 11.2. 

The loss landscape, where every stationary point is a global minimizer, implies 
that that there are no spurious local minimizers. This is often called the benign 
optimization landscape. Finding the conditions for a benign optimization landscape 


(| er) 


Fig. 11.1 Loss landscape for the function ¢(x) with (a) (11.15) is convex, and (b, c) PL conditions 
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Fig. 11.2 Inclusion 
relationship between invex, 
convex and PL-type functions Invex 


of neural networks was an important theoretical interest of the theorists in machine 
learning. Originally observed by Kawaguch [141], Lu and Kawaguchi [142] and 
Zhou and Liang [143] have proven that the loss surfaces of linear neural networks, 
whose activation functions are all linear functions, do not have any spurious local 
minima under some conditions and all local minima are equally good. 

Unfortunately, this good property no longer stands when the activations are 
nonlinear. Zhou and Liang [143] show that ReLU neural networks with one hidden 
layer have spurious local minima. Yun et al. [144] prove that ReLU neural networks 
with one hidden layer have infinitely many spurious local minima when the outputs 
are one-dimensional. 

These somewhat negative results were surprising and seemed to contradict the 
empirical success of optimization in neural networks. Indeed, it was later shown 
that if the activation functions are continuous, and the loss functions are convex 
and differentiable, over-parameterized fully-connected deep neural networks do not 
have any spurious local minima [145]. 

The reason for the benign optimization landscape for an over-parameterized 
neural network was analyzed by examining the geometry of the global minimum. 
Nguyen [123] discovered that the global minima are interconnected and concen- 
trated on a unique valley if the neural networks are sufficiently over-parameterized. 
Similar results were obtained by Liu et al. [124]. In fact, they found that the 
set of solutions of an over-parameterized system is generically a manifold of 
positive dimensions, with the Hessian matrices of the loss function being positive 
semidefinite but not positive definite. Such a landscape is incompatible with 
convexity unless the set of solutions is a linear manifold. However, the linear 
manifold with zero curvature of the curve of global minima is unlikely to occur 
due to the essential non-convexity of the underlying optimization problem. Hence, 
gradient type algorithms can converge to any of the global minimum, although the 
exact point of the convergence depends on a specific optimization algorithm. This 
implicit bias of an optimization algorithm is another important theoretical topic 
in deep learning, which will be covered in a later chapter. In contrast, an under- 
parameterized landscape generally has several isolated local minima with a positive 
definite Hessian of the loss, the function being locally convex. This is illustrated in 
Fig. 11.3. 
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Global Minima 
Local Minima 


Fig. 11.3 Loss landscapes of (a) under-parameterized models and (b) over-parameterized models 


11.4 Lyapunov-Type Convergence Analysis 


Now let us introduce a different type of convergence analysis with a different 
mathematical flavor. In contrast to the methods discussed above, the analysis of 
the global loss landscape is not required here. Rather, a local loss geometry along 
the solution trajectory is the key to this analysis. 

In fact, this type of convergence analysis is based on Lyapunov stability analysis 
[146] for the solution dynamics described by (11.8). Specifically, for a given 
nonlinear system, 


A(t] = gr), (11.18) 


the Lyapunov stability analysis is concerned with checking whether the solution 
trajectory @[t] converges to zero as t > oo. To provide a general solution for this, 
we first define the Lyapunov function V (z), which satisfies the following properties: 


Definition 11.1 A function V : R” +} Ris positive definite (PD) if 


¢ V(z) = 0 for all z. 
¢ V(z) = Oif and only if z = 0. 
e All sublevel sets of V are bounded. 


The Lyapunov function V has an analogy to the potential function of classical 
dynamics, and —V can be considered the associated generalized dissipation func- 
tion. Furthermore, if we set z := 0[f] to analyze the nonlinear dynamic system in 
(11.18), then V : z € R” }& Ris computed by 


ali TI 
V(z) = () z= (=) g(z). (11.19) 
az Oz 
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The following Lyapunov global asymptotic stability theorem is one of the keys 
to the stability analysis of dynamic systems: 


Theorem 11.2 (Lyapunov Global Asymptotic Stability [146]) Suppose there is 
a function V such that 1) V is positive definite, and 2) V(z) < 0 for all z 4 0 and 
V(0) = 0. Then, every trajectory 0[t] of 8 = g(@) converges to zero as t > ©. 
(i.e., the system is globally asymptotically stable). 


Example: 1-D Differential Equation 
Consider the following ordinary differential equation: 


6=-0. 


We can easily show that the system is globally asymptotically stable since the 
solution is 0[t] = C exp(—t) for some constant C, and 0[t] > 0 ast — oo. 
Now, we want to prove this using Theorem 11.2 without ever solving the 
differential equation. First, choose a Lyapunov function 


2) 

ig 
Vg=—, 
(z) 5 


where z = 6[t]. We can easily show that V(z) is positive definite. Further- 
more, we have 


V=zz= -([t])? <0, Volt] 40. 


Therefore, using Theorem 11.2 we can show that @[t] converges to zero as 
t- @~. 


One of the beauties of Lyapunov stability analysis is that we do not need an 
explicit knowledge of the loss landscape to prove convergence. Instead, we just need 
to know the local dynamics along the solution path. To understand this claim, here 
we apply Lyapunov analysis to the convergence analysis of our gradient descent 
dynamics: 


. ae 
6[t] = — 39 (O[t]). 


For the MSE loss, this leads to 


- OF ofr (*) 


6[r] = 39 — Foi) - (11.20) 
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Now let 


elt] := foj@)—y, 


and consider the following positive definite Lyapunov function 


1 
V(z) = =z! 2, 
(Z) 5e 2 
where z = e[t]. Then, we have 
ay 
VQ (=) z=2'd. (11.21) 
dz 
Using the chain rule, we have 
ar\. 3 
(A e— — O[t => —K TI, 
z= elt] (34) [t] elt] 


where 


= 
K, = Koy := () afo (11.22) 


00 


6=0[t] 


is often called the neural tangent kernel (NTK) [130-132]. By plugging this into 
(11.21), we have 


V = —ne[t]' K;e[t]. (11.23) 


Accordingly, if the NTK is positive definite for all t, then V(z) < 0. Therefore, 
e[t] > 0so that f(@[t]) — y ast — oo. This proves the convergence of gradient 
descent approach. 


11.4.1 The Neural Tangent Kernel (NTK) 


In the previous discussion we showed that the Lyapunov analysis only requires 
a positive-definiteness of the NTK along the solution trajectory. While this is a 
great advantage over PL-type analysis, which requires knowledge of the global loss 
landscape, the NTK is a function of time, so it is important to obtain the conditions 
for the positive-definiteness of NTK along the solution trajectory. 

To understand this, here we are interested in deriving the explicit form of the 
NTK to understand the convergence behavior of the gradient descent methods. 
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Using the backpropagation in Chap. 6, we can obtain the weight update as follows: 


afs agi aol? ag? aah? 
avec(W) — aVEC(W™) ag ao ag 


= (o ® T@)AQWEDTAGD yar on WHT AL), 
Similarly, we have 


af 7 ag”? do! ) ag & +1) ao 


db ab ag? 900 ag 
= AO WEDTAGD WaT cee WHT A) 


Therefore, the NTK can be computed by 


KOs (fe) os afo 


6=6[t] 
(24 ) afo Ga) af 
eo aVEc(W) ab} ap 


(jo (I? + DM [4], 


-y 

L 

=>) 
where 


M Ir] = AOwW try ae WD TAS AOwe [4 an Wola”. 
(11.24) 


Therefore, the positive definiteness of the NTK comes from the properties of 
M{t]. In particular, if M[t] is positive definite for any /, the resulting NTK 
is positive definite. Moreover, the positive-definiteness of M[r] can be readily 
shown if the following sensitivity matrix is full row ranked: 


§O = AO Wr] be WD A®, 


11.4.2. NTK at Infinite Width Limit 


Although we derived the explicit form of the NTK using backpropagation, still the 
component matrix in (11.24) is difficult to analyze due to the stochastic nature of 
the weights and ReLU activation patterns. 
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To address this problem, the authors in [130] calculated the NTK at the infinite 
width limit and showed that it satisfies the positive definiteness. Specifically, they 
considered the following normalized form of the neural network update: 


0) —x, (11.25) 
1 
© = wo!) + pa) 11.26 
g Tao o B : ( ) 
pO Sete), (11.27) 
for / = 1,---,L, and d© denotes the width of the /-th layer. Furthermore, 


they considered what is sometimes called LeCun initialization, taking wo oy 


N (0. in) and be ~ N(O, 1). Then, the following asymptotic form of the NTK 


can be obtained. 

Theorem 11.3 (Jacot et al. [130]) For a network of depth L at initialization, with a 
Lipschitz nonlinearity o, and in the limit as the layers width d)... d4-) + &, 
the neural tangent kernel K‘“) converges in probability to a deterministic limiting 
kernel: 


KY +) @14q,. (11.28) 


(EY . dO xq 
C) . pdx 


Here, the scalar kernel x +> R is defined recursively by 


KY (x, x’) = amt ay a (11.29) 
cE, x") = KO, x POV, x) + VOM, 2), (11.30) 
where 
vx, x") = Ey [o(g(x))o(g(x’)] + B, (11.31) 
DTD (x, x") = Ey [6 (g(x) a (g(x')], (11.32) 


where the expectation is with respect to a centered Gaussian process g of covariance 
v©) and where & denotes the derivative of o. 


Note that the symptotic form of the NTK is positive definite since Paes > 0. 
Therefore, the gradient descent using the infinite width NTK converges to the global 
minima. Again, we can clearly see the benefit of the over-parameterization in terms 
of large network width. 
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11.4.3 NTK for General Loss Function 


Now, we are interested in extending the example above to the eooel loss function 
with multiple training data sets. For a given training data set {x,}‘_,, the gradient 
dynamics in (11.7) can be extended to 


n=1? 


N N 
- ae(fglan)) — SV afg (en) (rn) 
— dX 000 d 00. Ofg(n) 


where ¢(x,,) := €(f(xn)) with a slight abuse of notation. This leads to 


Salen) 


Fotm) = ( PY) 


7 2) Of (Xn) 9L(xn) 
7 a0 00 Af 9 (Xn) 


n=1 


dL(Xn) 


= =F Kile, 5 F(a) &, y’ 


n=1 


where K;(Xm, xX») denotes the (m, n)-th block NTK defined by 


eee Of o(Xn) 


Ki (Xm, Xn) i= ( 00 00 


6=6[t] 


Now, consider the following Lyapunov function candidate: 


N N 
V(z) = >. L(folem)) = >. lm + fi); 


m=1 m=1 


where 


ZI fo(x1) — f* (x1) 
Z2 Fo(x2) — f* (x2) 


t= . = > 


ZN i beh Pte 
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and f*(x) refers to fg*(Xm) with 6* being the global minimizer. We further 
assume that the loss function satisfies the property that 


Vn, €(fo(tn))>0, if fonASn» lfn*) =, 


so that V(z) is a positive definite function. Under this assumption, we have 


: ae m dL(Xm 
von B (Mbt) oF) 


m=1 


ug > > (2etfotem))" K, (py, %,) fon) 


6=6[t] 


m=1n=1 I fo(x ) 9 fon) 6=6[t] 
= —e[t]'Kiclelt], 
where 
0l(fg(*1)) 
aie: K,(x1,x1) «++ K,(x1, xn) 
e[t] = : , Kit]= : ef 
Serie) K,(xy,X1) ++» K; (xy, Xy). 


Ife@n) J o=6[1] 


Therefore, if the NTK K]f] is positive definite for all t, then Lyapunov stability 
theory guarantees that the gradient dynamics converge to the global minima. 


11.5 Exercises 


1. Show that a smooth £(@) is invex if and only if every stationary point of £(0) is a 
global minimum. 

2. Show that a convex function is invex. 

3. Let a > 0. Show that V(x, y) = x* + 2y? is a Lyapunov function for the system 


k =ay’—-x,y=-y—ax’. 


4. Show that V(x, y) = In(1 + x?) + y? is a Lyapunov function for the system 


pe) 
c= x(y —1) , » = ————. 
x=x(y—l),y ney 
5. Consider a two-layer fully connected network fo : R? — R?* with ReLU 
nonlinearity, as shown in Fig. 10.10. 
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(a) Suppose the weight matrices and biases are given by 


2—1 1 
(0) _ (0) _ 
Ww =|, nul b =| i 
Le ~9 
wh) = dd) _ 
- E i} ain [=] 


Given the corresponding input space partition in Fig. 10.11, compute the 
neural tangent kernel for each partition. Are they positive definite? 
(b) In problem (a), suppose that the second layer weight and bias are changed to 


12 0 
a) _ Q) 
mondo el} 


Given the corresponding input space partition, compute the neural tangent 
kernel for each partition. Are they positive definite? 


Chapter 12 ®) 
Generalization Capability of Deep cen 
Learning 


12.1 Introduction 


One of the main reasons for the enormous success of deep neural networks is 
their amazing ability to generalize, which seems mysterious from the perspective 
of classic machine learning. In particular, the number of trainable parameters in 
deep neural networks is often greater than the training data set, this situation 
being notorious for overfitting from the point of view of classical statistical 
learning theory. However, empirical results have shown that a deep neural network 
generalizes well at the test phase, resulting in high performance for the unseen data. 

This apparent contradiction has raised questions about the mathematical foun- 
dations of machine learning and their relevance to practitioners. A number of 
theoretical papers have been published to understand the intriguing generalization 
phenomenon in deep learning models [147-153]. The simplest approach to studying 
generalization in deep learning is to prove a generalization bound, which is typically 
an upper limit for test error. A key component in these generalization bounds is the 
notion of complexity measure: a quantity that monotonically relates to some aspect 
of generalization. Unfortunately, it is difficult to find tight bounds for a deep neural 
network that can explain the fascinating ability to generalize. 

Recently, the authors in [154, 155] have delivered groundbreaking work that can 
reconcile classical understanding and modern practice in a unified framework. The 
so-called “double descent” curve extends the classical U-shaped bias-variance trade- 
off curve by showing that increasing the model capacity beyond the interpolation 
point leads to improved performance in the test phase. Particularly, the induced bias 
by optimization algorithms such as the stochastic gradient descent (SGD) offers 
simpler solutions that improve generalization in the over-parameterized regime. 
This relationship between the algorithms and structure of machine learning models 
describes the limits of classical analysis and has implications for the theory and 
practice of machine learning. 
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This chapter also presents new results showing that a generalization bound 
based on the robustness of the algorithm can be a promising tool to understand 
the generalization ability of the ReLU network. In particular, we claim that it 
can potentially offer a tight generalization bound that depends on the piecewise 
linear nature of the deep neural network and the inductive bias of the optimization 
algorithms. 


12.2 Mathematical Preliminaries 


Let Q be an arbitrary distribution over z := (x, y), where x € X and y € Y denote 
the input and output of the learning algorithm, and Z := X x Y refer to the sample 
space. Let ¥ be a hypothesis class and let £(f, z) be a loss function. For the case of 
regression with MSE loss, the loss can be defined as 


1 
&(f,2) = sly - f (x)II*. 


Over the choice of an i.i.d. training set S := fan i 


to Q, an algorithm A returns the estimated hypothesis 


which is sampled according 


fs =A(S). (21) 


For example, the estimated hypothesis from the popular empirical risk minimization 
(ERM) principle [10] is given by 


Sf erm = argmin Ry(f), (12.2) 
SEF 


where the empirical risk Ry (f) is defined by 


N 
RB 1 
Ry(f) = eS n), (12.3) 
n=1 
which is assumed to uniformly converge to the population (or expected) risk defined 
by: 
R(f) =E,-at (fz). (12.4) 
If uniform convergence holds, then the empirical risk minimizer (ERM) is consis- 


tent, that is, the population risk of the ERM converges to the optimal population 
risk, and the problem is said to be learnable using the ERM [10]. 
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In fact, learning algorithms that satisfy such performance guarantees are called 
the probably approximately correct (PAC) learning [156]. Formally, PAC learnabil- 
ity is defined as follows. 


Definition 12.1 (PAC Learnability [156]) A concept class C is PAC learnable if 
there exist some algorithm A and a polynomial function poly(-) such that the 
following holds. Pick any target concept c¢ € C. Pick any input distribution P 
over X. Pick any €,6 € [0,1]. Define S := {xp, re ae) ae where x, ~ Ff are 
iid samples. Given N > poly(1/e, 1/6, dim(X), size(c)), where dim(X), size(c) 
denote the computational costs of representing inputs x € X and target c, the 
generalization error is bounded as 


Pywg {As(x) # e(x)} < €, (12.5) 


where Ags denotes the learned hypothesis by the algorithm A using the training 
data S. 


The PAC learnability is closely related to the generalization bounds. More 
specifically, the ERM could only be considered a solution to a machine learning 
problem or PAC-learnable if the difference between the training error and the 
generalization error, called the generalization gap, is small enough. This implies 
that the following probability should be sufficiently small: 


P wp RU) ~ Rycpi> el. (12.6) 
SEF 


Note that this is the worst-case probability, so even in the worst-case scenario, we 
try to minimize the difference between the empirical risk and the expected risk. 

A standard trick to bound the probability in (12.6) is based on concentration 
inequalities. For example, Hoeffding’s inequality is useful. 


Theorem 12.1 (Hoeffding’s Inequality [157]) If x), x2,--- ,xn are N i.i.d. sam- 
ples of a random variable X distributed by P, and a < xy < b for every n, then for 


a small positive nonzero value €: 


| 


1 N 
EIX]— = Dn 


n=1 


_ ane? 
= | <2exp =) (12.7) 
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Assuming that our loss is bounded between 0 and 1 using a 0/1 loss function 
or by squashing any other loss between 0 and 1, (12.6) can be bounded as follows 
using Hoeffding’s inequality: 


P wp RU — &ucni>el =P} JIRA) - Rv (f)| > € 
fer fF 
(a) A 
= Pian -kvpi>e} 028) 
SEF 


= 2|F| exp(—2Ne’), 


where |] is the size of the hypothesis space and we use the union bound in (a) to 
obtain the inequality. By denoting the right hand side of the above inequality by 6, 
we can say that with probability at least 1 — 6, we have 


, 1 +In2 
Rf) < Rv(f) +) eae (12.9) 


Indeed, (12.9) is one of the simplest forms of the generalization bound, but still 
reveals the fundamental bias—variance trade-off in classical statistical learning 
theory. For example, the ERM for a given function class ¥ results in the minimum 
empirical loss: 


Ry Cf erw) = min Ry(f), (12.10) 


E 


which goes to zero as the hypothesis class ¥ becomes bigger. On the other hand, the 
second term in (12.9) grows with increasing |¥]. This trade-off in the generalization 
bound with respect to the hypothesis class size |F] is illustrated in Fig. 12.1. 

Although the expression in (12.9) looks very nice, it turns out that the bound is 
very loose. This is due to the term |F] which originates from the union bound of all 
elements in the hypothesis class ¥. In the following, we discuss some representative 
classical approaches to obtain tighter generation bounds. 


12.2.1 Vapnik—Chervonenkis (VC) Bounds 


One of the key ideas of the work of Vapnik and Chervonenkis [10] is to replace 
the union bound for all hypothesis class in (12.8) with the union bound of simpler 
empirical distributions. This idea is historically important, so we will review it here. 
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Fig. 12.1 Generation bound behavior according to the hypothesis class size |F] 


More specifically, consider independent samples z), := (x),, y,) forn = 
1,--- , N, which are often called “ghost” samples. The associated empirical risk 
is given by 

1 
Ry(f) = 3 EF a) (12.11) 


n=1 


Then, we have the following symmetrization lemma. 


Lemma 12.1 (Symmetrization[10]) Fora given sample set S := {xn, yan ; and 


N_, from a distribution Q and for any € > 0 


its ghost samples set S’ := {x',, yi) }_| 


such that € > ./2/N, we have 
a IRF) — Rn (f)| > | < 2 wp Ry (f) — Rv (f)I > | . (12.12) 
feF SEF 


Vapnik and Chervonenkis [10] used the symmetrization lemma to obtain a much 
tighter generalization bound: 


rfp aucpi> ef <2| sup in Avni 5| 
SEF feFs 5 


=2P) LJ IRy(f)- Rv (fl > 


SEF 5 
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< 2GFQN)-P{IRy(N) — Rv Al > ef 
< 2G¢(2N) exp(—Ne?/8), 


where the last inequality is obtained by Hoeffding’s inequality and Fg 5 denotes 
the restriction of the hypothesis class to the empirical distribution for S, S’. Here, 
G¢(-) is called the growth function defined by 


G¢(2N) := |Fs.g'l, (12.13) 


which represents the number of the most possible sets of dichotomies using the 
hypothesis class ¥ on any 2N points from S and S’. 

The discovery of the growth function is one of the important contributions of 
Vapnik and Chervonenkis [10]. This is closely related to the concept of shattering, 
which is formally defined as follows. 


Definition 12.2 (Shattering) We say F shatters S if |F] = 2'S!. 


In fact, the growth function G¢(N) is often called the shattering number: the 
number of the most possible sets of dichotomies using the hypothesis class ¥ on 
any N points. Below, we show several facts for the growth function: 


* By definition, the shattering number satisfies G-(N) < 2%. 

¢ When F is finite, we always have G-(N) = |F]. 

* If Gz(N) = 2%, then there is a set of N points such that the class of functions F 
can generate any possible classification result on these points. Figure 12.2 shows 
such a case where ¥ is the class of linear classifiers. 


SR ORSEL 
A SGRBE 


Fig. 12.2 Most possible sets of dichotomies using linear classifier on any three points. The 
resulting shattering number is G7(3) = 8 
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Accordingly, we arrive at the following classical VC bound [10]: 


Theorem 12.2 (VC Bound) For any 6 > 0, with probability at least 1—6, we have 


8 In Gz(2N) + 8 In ¢ 
= ; 


R(f) < Rv(f) + (12.14) 


Another important contribution of the work by Vapnik and Chervonenkis [10] 
is that the growth function can be bounded by the so-called VC dimension, and 
the number of data points for which we cannot get all possible dichotomies (=VC 
dimension +1) is called the break point. 


Definition 12.3 (VC Dimension) The VC dimension of a hypothesis class ¥ is the 
largest N = dyc(F) such that 


GN) = 2". 


In other words, the VC dimension of a function class ¥ is the cardinality of the 
largest set that it can shatter. 


This means that the VC dimension is a measure of the capacity (complexity, 
expressiveness, richness, or flexibility) of a set of functions that can be learned from 
a Statistical binary classification algorithm. It is defined as the cardinality of the 
largest number of points that the algorithm can classify with zero training error. In 
the following, we show several examples where we can explicitly calculate the VC 
dimensions. 


Example: Half-Sided Interval 
Consider any function of the form fF = {f(x) = x(x < 9),0@ € R}. It 


can shatter two points, but any three points cannot be shattered. Therefore, 
dyc(F) = 2. 


Example: Half Plane 

Consider a hypothesis class F composed of half planes in R4. It can shatter 
d+ 1 points, but any d + 2 points cannot be shattered. Therefore, dyc(F) = 
d+. 
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Example: Sinusoids 

f is a single-parametric sine classifier, i.e, for a certain parameter 0, the 
classifier fg returns 1 if the input number x is larger than sin(@x) and 0 
otherwise. The VC dimension of f is infinite, since it can shatter any finite 
subset of the set {27” | m € N}. 


Finally, we can derive the generalization bound using the VC dimension. For this, 
the following lemma by Sauer is the key element. 


Lemma 12.2 (Sauer’s Lemma[158]) Suppose that F has a finite VC dimension 
dyc. Then 


dyc , 
G < 12.15 
¢(n) <)> (") (12.15) 
and for alln => dyc, 


dvc 
Gy(n) < (=) (12.16) 
VC 


Corollary 12.1 (VC Bound Using VC Dimension) Let dyc > N. Then, for any 
5 > 0, with probability at least 1 — 5, we have 


F 8dyc In" + 81n3 
R(f) < Rn(f) + i (12.17) 
Proof This is a direct consequence of Theorem 12.2 and Lemma 12.2. Oo 


The VC dimension has been studied for deep neural networks to understand 
their generalization behaviors [159]. Bartlett et al. [160] proves bounds on the VC 
dimension of piece-wise linear networks with potential weight sharing. Although 
this measure could be predictive when the architecture changes, which happens only 
in depth and width hyperparameter types, the authors in [159] also found that it 
is negatively correlated with the generalization gap, which contradicts the widely 
known empirical observation that over-parametrization improves generalization in 
deep learning [159]. 
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12.2.2. Rademacher Complexity Bounds 


Another important classical approach for the generalization error bound is 
Rademacher complexity [161]. To understand this concept, consider the following 
toy example. Let S := {(xn, Vila denote the training sample set, where 
yn € {—1, 1}. Then, the training error can be computed by 


N 
1 
errn(f) = = DAS Cn) # nd, (12.18) 


n=1 


where 1[-] is an indicator function computed by 


1, {f(rn), yo} = (1, -U.{-L 
A[f@n) Ayn = ; (12.19) 
0, {f (Xn), yn} = (1, Uy, {-1, - 


Then, (12.18) can be equivalently represented by 


N 


1 1 — yn f (Xn 
errn(f) = N > nT 


n=1 


1 


1 N 
=e dX, Yn (Xn) - (12.20) 


correlation 


Therefore, minimizing the training error is equivalent to maximizing the correlation. 
Now, the core idea of the Rademacher complexity is to consider a game where a 
player generates random targets { y, Ie , and another player provides the hypothesis 
that maximize the correlation: 


N 
1 

sup — Yn f tn). (12.21) 
on 2 

Note that the idea is closely related to the shattering in VC analysis. Specifically, 
if the hypothesis class F¥ shatters $ = {xn, ae then the correlation becomes 
a maximum. However, in contrast to the VC analysis that considers the worst- 
case scenario, Rademacher complexity analysis deals with average-case analysis. 


Formally, we define the so-called Rademacher complexity [161]. 
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Definition 12.4 (Rademacher Complexity[161]) Let 0; --- , oy be independent 
random variables P{o, = 1} = P{o= — 1} = 5. Then, the empirical Rademacher 
complexity of is defined by 


N 
1 
Rady (Ff, S) = Eg | sup — nf (Xn) |, (12.22) 
~~ ape a 


where 0 = [o},--- Jon]. In addition, the general notion of Rademacher 
complexity is computed by 


Rady (Ff) := Es [Rady (Ff, S)]. (12.23) 


Another important advantage of Rademacher complexity is that it can be easily 
generalized to the regression problem for the vector target. For example, (12.23) can 
be generalized as follows: 


1 N 
Rady (F) =E sp yom res) (12.24) 


n=1 


where {o a , tefers to the independent random vectors. In the following, we pro- 
vide some examples where the Rademacher complexity can be explicitly calculated. 


Example: Minimum Rademacher Complexity 
When the hypothesis class has one element, i.e. |F] = 1, we have 


1 N 
Rad(F) = 2 ap inten) = pean] 5 Yo.| = 0, 


n=1 


where the second equality comes from the fact that f(x,) = f (x1) for all n 
when |#| = 1. The final equation comes from the definition of the random 
variable oy. 


Example: Maximum Rademacher Complexity 
When |¥] = 2", we have 


N 
Rad(F) =E fan = xz, enfin) |= al Sy “| =1, 
EF N n=1 


(continued) 
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where the second equality comes from the fact that we can find a hypothesis 
such that f(x,) = op for all n. The final equation comes from the definition 
of the random variable o,. 


Although the Rademacher complexity was originally derived above for the 
binary classifiers, it can also be used to evaluate the complexity of the regression. 
The following example shows that a closed form Rademacher complexity can be 
obtained for ridge regression. 


Example: Ridge Regression 
Let F be the class of linear predictors given by y = w 
of ||w|| < W and ||x|| < X. Then, we have 


Tx with the restriction 


N 
1 
Rad(Ff, S) = Eg sup — o,w! Xp 
w:||w|<w N d 


N 
1 
= (5 sup 2" ( cet) 
N ay d 


> 

(b) W 
we 2) 

< N DE lize) 


where (a) comes from the definition of the /; norm, and (b) comes from 
Jensen’s inequality. 


Using the Rademacher complexity, we can now derive a new type of generaliza- 
tion bound. First, we need the following concentration inequality. 


Lemma 12.3 (McDiarmid’s Inequality[161]) Let x,,---,x) be independent 
random variables taking on values in a set X and let c\,--+ , Cy be positive real 
constants. If p : XN + R satisfies 


sup IPX1, +++ Xn,0+ XN) — OXI, Xian »XN)| Sn, 
X1, XN xX), CA 
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for\<n<N, then 


262 
P{lg@1,-++ . xn) — Eg, +++ ,xn)| = €} S 2exp (-) . (12.25) 


2 
n=1n 


In particular, if @(x1,--- ,xN) = ee Xn/N, the inequality (12.25) reduces to 
Hoeffding’s inequality. 
Using McDiarmid’s inequality and symmetrization using “ghost samples”, we 


can obtain the following generalization bound. 


Theorem 12.3 (Rademacher Bound) Let S := {x,, yn}‘_, denote the training 


n=1 


setand f(x) € [a, b]. For any 6 > 0, with probability at least 1 — 5, we have 


P In 1/6 
R(f) < Rn(f) + 2Rady (F) + (b — a) aN (12.26) 
and 
r In2/6 
R(f) < Ry(f) + 2Rady (F, S) + 3(b — a) Ta (12.27) 


Unfortunately, many theoretical efforts using the Rademacher complexity to 
understand the deep neural network were not successful [159], which often resulted 
in a vacuous bound similar to the attempts using VC bounds. Therefore, the need to 
obtain a tighter bound has been increasing. 


12.2.3, PAC-Bayes Bounds 


So far, we have discussed performance guarantees which hold whenever the training 
and test data are drawn independently from an identical distribution. In fact, 
learning algorithms that satisfy such performance guarantees are called the probably 
approximately correct (PAC) learning [156]. It was shown that the concept class C 
is PAC learnable if and only if the VC dimension of C is finite [162]. 

In addition to PAC learning, there is another important area of modern learning 
theory—Bayesian inference. Bayesian inferences apply whenever the training and 
test data are generated according to the specified prior. However, there is no 
guarantee of an experimental environment in which training and test data are 
generated according to a different probability distribution than the previous one. In 
fact, much of modern learning theory can be broken down into Bayesian inference 
and PAC learning. Both areas investigate learning algorithms that use training data 
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as the input and generate a concept or model as the output, which can then be tested 
on test data. 

The difference between the two approaches can be seen as a trade-off between 
generality and performance. We define an “experimental setting” as a probability 
distribution over training and test data. A PAC performance guarantee applies to a 
wide class of experimental settings. A Bayesian correctness theorem applies only 
to experimental settings that match those previously used in the algorithm. In this 
restricted class of settings, however, the Bayesian learning algorithm can be optimal 
and generally outperforms the PAC learning algorithms. 

The PAC-Bayesian theory combines Bayesian and frequentist approaches [163]. 
The PAC-—Bayesian theory is based on a prior probability distribution concerning 
the “situation” occurring in nature, and a “rule” expresses a learner’s preference for 
some rules over others. There is no supposed relationship between the learner’s bias 
for rules and the nature distribution. This differs from the Bayesian inference, where 
the starting point is a common distribution of rules and situations, which induces a 
conditional distribution of rules in certain situations. 

Under this set-up, the following PAC—Bayes generalization bound can be 
obtained. 


Theorem 12.4 (PAC—Bayes Generalization Bound) [163] Let Q be an arbitrary 
distribution over Z := (x, y) € Z := &X x Y. Let F be a hypothesis class and let 
£ be a loss function such that for all f and z we have €(f,z) € [0, 1]. Let P bea 
prior distribution over F and let 5 € (0, 1). Then, with probability of at least 1 — 6 
over the choice of an i.i.d. training set S := {zn sampled according to Q, for 
all distributions Q over F (even such that depend on S), we have 


A KL(Q\||P) +InN/6 
Ep IR(f)] <Ey~a [ewcn]+/ — (12.28) 
where 
KLQ\IP) = By~alnQ@ f)/PU) (12.29) 


is the Kullback—Leibler divergence. 


Recently, PAC—Bayes approaches have been studied extensively to explain 
the generalization capability of neural networks [149, 153, 164]. According to a 
recent large scale experiment to test the correlation of different measures with the 
generalization of deep models [159], the authors confirmed the effectiveness of the 
PAC-Bayesian bounds and corroborate them as a promising direction for cracking 
the generalization puzzle. Another nice application of PAC—Bayes bounds is that it 
provides a mean to find the optimal distribution Q* by minimizing the upper bounds. 
This technique has been successfully used for the linear classifier design [164], etc. 
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12.3 Reconciling the Generalization Gap 
via Double Descent Model 


Recall that the following error bound can be obtained for the ERM estimate in 
(12.2): 


ca Cc 
R(ferm) = Ry(ferm) + o(/<) ; (12.30) 
ig pel 
empirical risk (training error) — 


complexity penalty 


where O(-) denotes the “big O” notation and c refers to the model complexity such 
as VC dimension, Rademacher complexity, etc. 

In (12.30), with increasing hypothesis class size |¥], the empirical risk or 
training error decreases, whereas the complexity penalty increases. The control of 
the functional class capacity can be therefore done explicitly by choosing F (e.g. 
selection of the neural network architecture). This is summarized in the classic U- 
shaped risk curve, which is shown in Fig. 12.3a and was often used as a guide for 
model selection. A widely accepted view from this curve is that a model with zero 
training error is overfitted to the training data and will typically generalize poorly 
[10]. Classical thinking therefore deals with the search for the “sweet spot” between 
underfitting and overfitting. 

Lately, this view has been challenged by empirical results that seem mysterious. 
For example, in [165] the authors trained several standard architectures on a copy 
of the data, with the true labels being replaced by random labels. Their central 
finding can be summarized as follows: deep neural networks easily fit random labels. 
More precisely, neural networks achieve zero training errors if they are trained on 
a completely random labeling of the true data. While this observation is easy to 
formulate, it has profound implications from a statistical learning perspective: the 
effective capacity of neural networks is sufficient to store the entire data set. Despite 
the high capacity of the functional classes and the almost perfect fit to training data, 
these predictors often give very accurate predictions for new data in the test phase. 

These observations rule out VC dimension, Rademacher complexity, etc. from 
describing the generalization behavior. In particular, the Rademacher complexity 
for the interpolation regime, which leads to a training error of 0, assumes the 
maximum value of 1, as previously explained in an example. Therefore, the classic 
generalization bounds are vacuous and cannot explain the amazing generalization 
ability of the neural network. 

The recent breakthrough in Belkin et al.’s “double descent” risk curve [154, 155] 
reconciles the classic bias—variance trade-off with behaviors that have been observed 
in over-parameterized regimes for a large number of machine learning models. In 
particular, when the functional class capacity is below the “interpolation threshold’, 
learned predictors show the classic U-shaped curve from Fig. 12.3a, where the 
function class capacity is identified with the number of parameters needed to specify 
a function within the class. The bottom of the U-shaped risk can be achieved at 
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Fig. 12.3 Curves for training risk (dashed line) and test risk (solid line). (a) The classical U- 
shaped risk curve arising from the bias—variance trade-off. (b) The double descent risk curve, 
which incorporates the U-shaped risk curve (i.e., the “classical” regime) together with the observed 
behavior from using high-capacity function classes (i.e., the “modern” interpolating regime), 
separated by the interpolation threshold. The predictors to the right of the interpolation threshold 
have zero training risk 


the sweet spot which balances the fit to the training data and the susceptibility 
to over-fitting. When we increase the function class capacity high enough by 
increasing the size of the neural network architecture, the learned predictors achieve 
(almost) perfect fits to the training data. Although the learned predictors obtained 
at the interpolation threshold typically have high risk, increasing the function class 
capacity beyond this point leads to decreasing risk, which typically falls below the 
risk achieved at the sweet spot in the “classic” regime (see Fig. 12.3b). 

In the following example we provide concrete and explicit evidence for the 
double descent behavior in the context of simple linear regression models. The 
analysis shows the transition from under- to over-parameterized regimes. It also 
allows us to compare the risks at any point on the curve and explain how the risk in 
the over-parameterized regime can be lower than any risk in the under-parameterized 
regime. 
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Example: Double Descent in Regression [155] 
We consider the following linear regression problem: 


y=x'Be, (12.31) 


where B € R? and x and € are a normal random vector and a variable, where 
x ~ NO, Ip) ande ~ N(O, 07). Given training data {xp, Wie as we fit a 
linear model to the data using only a subset T C [D] of cardinality of p, 
where [D] := {0,--- , D}. Let X = [x1,--- ,xw] € R?*" be the design 
matrix, y = [y1,--- , yw]! be the vector of response. For a subset 7, we use 
Br to denote its |T |-dimensional subvector of entries from T; we also use X 7 
to denote an N x p sub- Avan of X composed of columns in 7’. Then, the 
risk of B, where Br = =x! 7y and Bre = = 0, is given by 


(Brel? +o2)(1+y25); fp sN-2 
R 00; if N-l<p<N+1 
E|(y — x") |= 
[ Bri? (1— *) 
+(\Brel? +07) (14+ 4-4): if p=N+2. 
(12.32) 


Proof Recall that x is assumed to be a Gaussian distribution with zero mean 


and identity covariance, so that the mean squared prediction error can be 
written as 


B[(y—x"B)?| =B[@™B +0¢—x" By] =07 + BIB - BI? 
= 07 + ||Brell? + Ebr — Brill’, 


where B denotes the ground-truth regression parameter and we use the 
independency of the test phase regressor x and the training phase design 
matrix X. Our goal is now to derive the closed form expression for the second 
term. 

(Classical regime) For the given training data set, we have 


A 


Br = (XrX}) 'Xry = (Xr Xp) Xr XP Br t+ (XrXy) Xr 


= Br +(XrX7) 'Xrn, 


(continued) 
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where 
n= y—XpBrp =e + Xb Bre. 
By plugging this into the second term, we have 
BIBr — Br? =E[ 9" Paxrya| = T(E [Paar] 2 [a |). 
In addition, we have 
T T T T aa 
B[m | = Blee | + E XpeBre (Xt-Bre) 
= (07 + Brel) In, 


where R(X7) denotes the range space of X7 and Prx,) denotes the 
projection to the range space of X 7. Furthermore, Pgx,) is Hotelling’s T- 
squared distribution with parameter p and N — p + | so that 


———, iip<N=2 
THE[Paaxp]=yer P= (12.33) 
+00, ifp=N-1 
Therefore, by putting them together we conclude the proof for the classical 
regime. 
(Modern interpolating regime) We consider p > N. Then, we have 


Br = Xp (XrXp) y= Xp (Xr Xp) XP By + Xp (XrXp) | 


= XI (X¥7X)) x17 4+ XE (Xr X41) 9 


=Pryt Br +Xp(XrXp) 0, 
where 
Hi=yr- Nea =e+ ps iees 
Therefore, 
B| Br — Bri? |] = EWP aeer Bri? | +B [a rXz) a]. 


(continued) 
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Furthermore, we have 


B|IPxaet)6r | = (: - *) BrP 
Bl" (XrX7)'9| = Tr(BXrX7) 'B[ mn" |), 


where we use the independency between X7 and X7- and e for the second 
equality. In addition, we have 


E [a] =E [<<"| +E [xf Br (xi-Br-) | 
= (07 + |Brell?)Iw- 


Finally, the distribution of P := (X 7X ee is inverse- Wishart with identity 
scale matrix I with p degrees of freedom. Accordingly, we have 


N : 

——, ifp>N+2 
Tr(E(XrX7)"') = Fanaa 

+00, ifp=N,N+1 


By putting them together, we have 


ner ee el 2 2 2 = 
e[@-x p]=(1 ~) Br + (0? + [Brel )(14 ), 


for p > N andE [o = x™B)?| = oo for p = N, N + 1. This concludes the 
proof. Oo 


Figure 12.4 illustrates an example plot for the linear regression problem analyzed 
above for a particular parameter set. 


12.4 Inductive Bias of Optimization 


All learned predictors to the right of the interpolation threshold fit perfectly with 
the training data and have no empirical risk. Then, why should some—especially 
those from larger functional classes—have a lower test risk than others so that 
they generalize better? The answer is that the functional class capacity, such 
as VC dimension, or Rademacher complexity, does not necessarily reflect the 
inductive bias of the predictor appropriate for the problem at hand. Indeed, one 
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Fig. 12.4 Plot of the risk in (12.32) as a function of p under the random selection of T. Here 
|B\2 = 1,02 = 1/25 and N = 40 


of the underlying reasons for the appearance of the double descent model in the 
previous linear regression problem is that we impose an inductive bias to choose the 
minimum norm solution B pH X7(X 7X nr)! y for the over-parameterized regime, 
which leads to the smooth solution. 

Among the various interpolation solutions, choosing the smooth or simple 
function that perfectly fits the observed data is a form of Occam’s razor: the simplest 
explanation compatible with the observations should be preferred. By considering 
larger functional classes that contain more candidate predictors that are compatible 
with the data, we can find interpolation functions that are “simpler”. Increasing 
the functional class capacity thus improves the performance of classifiers. One 
of the important advantages of choosing a simpler solution is that it is easy to 
generalize by avoiding unnecessary glitches in the data. Increasing the functional 
class capacity to the over-parameterized area thus improves the performance of the 
resulting classifiers. 

Then, one of the remaining questions is: what is the underlying mechanism by 
which a trained network becomes smooth or simple? This is closely related to 
the inductive bias (or implicit bias) of an optimization algorithm such as gradient 
descent, stochastic gradient descent (SGD), etc. [166-171]. Indeed, this is an active 
area of research. For example, the authors in [168] show that the gradient descent 
for the linear classifier for specific loss function leads to the maximum margin SVM 
classifier. Other researchers have shown that the gradient descent in deep neural 
network training leads to a simple solution [169-171]. 


12.5 Generalization Bounds via Algorithm Robustness 


Another important question is how we can quantify the inductive bias of the 
algorithm in terms of a generalization error bound. In this section, we introduce 
a notion of algorithmic robustness for quantifying the generalization error, which 
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was originally proposed in [172], but has been largely neglected in deep learning 
research. It turns out that the generalization bound based on algorithmic robustness 
has all the ingredients to quantify the fascinating generalization behavior of the deep 
neural network, so it can be a useful tool for studying generalization. 

Recall that the underlying assumption for the classical generalization bounds is 
the uniform convergence of empirical quantities to their mean [10], which provides 
ways to bound the gap between the expected risk and the empirical risk by the 
complexity of the hypothesis set. On the other hand, robustness requires that a 
prediction rule has comparable performance if tested on a sample close to a training 
sample. This is formally defined as follows. 


Definition 12.5 (Algorithm Robustness [172]) Algorithm A is said to be 
(K, €(-))-robust for K € N ande(-): Zt R, if Z := X x Y can be partitioned 
into K disjoint sets, denoted by {C;};*_, such that the following holds for all training 
sets SC Z: 


Vs €S,Vz€ Z;ifs,zeC;, then |€(As,s) — €(As, z)| < €(S) (12.34) 


foralli = 1,--- , K, where Ag denotes the algorithm A trained with the data set S. 


Then, we can obtain the generalization bound based on algorithmic robustness. 
First, we need the following concentration inequality. 


Lemma 12.4 (Breteganolle-Huber—Carol Inequality [173]) Jf the random vec- 


tor (N,,--- , Ng) is multinomially distributed with parameters N and (p1,--- , Pk), 
then 
k 
P)>— IN — Npil = 2%} 2 ep(—21), A= oO (12.35) 
i=l 


Theorem 12.5 /f a learning algorithm A is (K, €(-))-robust, and the training 
sample set S is generated by N i.i.d samples from the probability measure 1, then 
for any 6 > 0, with probability at least 1 — 6 we have 


2K In2 + 2In(1/8) 


|R(As) — Ry(As)| < €(S) +m] y 


(12.36) 


where 


M := max |€(As, Z)|. 
zel 
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Proof Let N; be the set of indices of points of S that fall into the C;. Note that 


(IMi|,---.|Nx|) is an iid. multinomial random variable with parameters N and 
(u(Cj),--- , 4(CK)). Then, the following holds by Lemma 12.4. 
K 2 
[Ni | K Nx 
P — >See <2 ——— }. 12.37 


Hence, the following holds with probability at least 1 — 6, 


The generalization error is then given by 


(12.38) 


HC) 


< 7% In2 + 21In(1/8) 
— N * 


K N 


1 
Do Benb(As, zlz € Ci)M(Ci) — = DAs. si) 
i=1 n=1 


|R(As) — Rv(As)| < 


@ 


2 LSB AtiAs, zlz €C))! 


i=l 


Hs si) 


= x [Nil 
DE wll(As, 21z € Ci) (Ci) — D> Bew~pl(As, z1z € Ci) —— 
i=] o 


n=1 


+ 


1 
ty: Dae. max eA, s sj) — €(As, 22)| 


i= 1 jen; * 


Ni 
+ max |€(Asg, z)| 3 Mal H(C;) 


i=1 


<(S)- 


u(C;) 


d 
D yea fee 


where (a), (b), and (c) are due to the triangle inequality, the definition of N;, and the 
definition of €(S) and M, respectively. Oo 


Note that the definition of robustness requires that (12.34) holds for every training 
sample. The parameters K and €(-) quantify the robustness of an algorithm. Since 
€(-) is a function of training samples, an algorithm can have different robustness 
properties for different training patterns. For example, a classification algorithm is 
more robust to a training set with a larger margin. Since (12.34) includes both the 
trained solution Ag and the training set S, robustness is a property of the learning 
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algorithm, rather than the property of the “effective hypothesis space”. This is why 
the robustness-based generalization bound can account for the inductive bias from 
the algorithm. 

For example, for the case of a single-layer ReLU neural network fo : R* > R? 
with the following weight matrix and bias: 


2-1 1 
w = p= 
F | -1 


the corresponding neural network output is given by 


[0, OJ, 2x-y+1<0,x+y-1 <0, 
i= [2x —y+1,0]', 2x—-y+1>0,x+y-1<0O, 
(Ove Ayr, 2x-y+1<0,x+y-—120, 
[2x—-y+tl,xty—l1]', 2x-—y+120,x+y—-120. 


Here, the number of partitions is K = 4. 
On the other hand, consider a two-layer ReLU network with the weight matrices 


and biases given by 
2-1 1 
(0) _ (0) _ 
sibel a | 


12 0 
() p@ =] 9]. 
aie Yaa 


The corresponding neural network output is given by 


[0, 1]', 2x -y+1<0,x+y-1<0O, 
Kv (Qe psd, 1: 2x-y+1>0,x+y-1<0O, 
[2x ++2y—2,x+y]', 2x-—y+1<0,x+y—-120, 
[4x-ty—l,xty]’, 2x-y+1>0,x+y-120. 


Therefore, in spite of the twice larger parameter sizes, the number of partitions is 
K = 4, which is the same as the single-layer neural network. Therefore, in terms 
of the generalization bounds, the two algorithms have same upper bound up to the 
parameter €(S). This example clearly confirms that generalization is a property of 
the learning algorithm, rather than the property of the effective hypothesis space or 
the number of parameters. 
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1. Compute the VC dimension of the following function classes: 


(a) Interval [a, b]. 

(b) Disc in R2. 

(c) Half space in R¢. 

(d) Axis-aligned rectangles. 


2. Show that the classifier fg that returns | if the input number x is larger than 
sin(6x) and 0 otherwise can shatter any finite subset of the set {27 | m € N}. 
3. Prove the following properties of Rademacher complexity: 


(a) (Monotonicity) If F Cc G, then Rady (F) < Rady (G). 

(b) (Convex hull) Let conuv(F) be the convex hull of Ff. Then Rady(F) = 
Rady (conv(F)). 

(c) (Scale and shift) For any function class f and c,d € R. Rady(cf¥ + d) = 
|c|Rady(F). 

(d) (Lipschitz composition) If ¢ is an L-Lipschitz function, then Rady (@-F) < 
L - Radn(F). 


4, Let F be the class of linear predictors given by y = w' x with the restriction of 


\|w|lt < Wy, and ||x\loo < Xoo for x € R47. Then, show that 


W1Xo0V/2In(d) 
JN 


5. Let A be a set of N vectors in R”, and let a be the mean of the vectors in A. 
Then: 


Rady (F) < 


2log N 
aon 


Rady (A) < max |la — a|l2- 
acA 


In particular, if A is a set of binary vectors, 


log N 
Rady AS) 
m 


6. For a metric space S, op and J C S we say that T C Sis an e-cover of T, if 
Vt ¢ J, there exists t/ € J such that p(t, t’) < €. The €-covering number of J is 
defined by 


Ne, 7, e) = min{|7’| : 7’ is an €-cover of J}. 
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If Z is compact w.r.t. metric p, €(Ag,-) is Lipschitz continuous with 
Lipschitz constant c(S), i.e., 


|L(As, 21) — (As, 22)| < c(S)p (1, 22), W21,22 € Z, 
then show that A is (K, €(S))-robust, where 
K = N(y/2, Z, p), €(S) =c(S)y 


for y > 0. 


Chapter 13 Mm) 
Generative Models and Unsupervised cen 
Learning 


13.1 Introduction 


The last part of our voyage toward the understanding of the geometry of deep learn- 
ing concerns perhaps the most exciting aspect of deep learning—generative models. 
Generative models cover a large spectrum of research activities, which include the 
variational autoencoder (VAE) [174, 175], generative adversarial network (GAN) 
[88, 176, 177], normalizing flow [178-181], optimal transport (OT) [182-184], etc. 
This field has evolved very quickly, and at any machine learning conference like 
NeurIPS, CVPR, ICML, ICLR, etc., you may have seen exciting new developments 
that far surpass existing approaches. In fact, this may be one of the excuses why the 
writing of this chapter has been deferred till the last minute, since there could be 
new updates during the writing. 

For example, Fig. 13.1 shows the examples of fake human faces generated by 
various generative models starting from the GAN[88] in 2014 to styleGAN[89] 
in 2018. You may be amazed to see how the images become so realistic with so 
much detail within such a short time period. In fact, this may be another reason why 
DeepFake by generative models has become a societal problem in the modern deep 
learning era. 

Besides creating fake faces, another reason that a generative model is so 
important is that it is a systematic means of designing unsupervised learning 
algorithms. For example, in Yann LeCun’s famous cake analogy at NeurIPS 2016, 
he emphasized the importance of unsupervised learning by saying “If intelligence 
is a cake, the bulk of the cake is unsupervised learning, the icing on the cake is 
supervised learning, and the cherry on the cake is reinforcement learning (RL).” 
Referring to the GAN, Yann LeCun said that it was “the most interesting idea in the 
last 10 years in machine learning,” and predicted that it may become one of the most 
important engines for modern unsupervised learning. 

Despite their popularities, one of the reasons generative models are difficult to 
understand is that there are so many variations, such as the VAE [174], B- VAE [175], 
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Fig. 13.2 Geometry of generative models 


GAN [88], f-GAN [176], W-GAN [177], normalizing flow [178-180], GLOW 
[181], optimal transport [182-184], cycleGAN [185], W-GAN [177], starGAN [87], 
CollaGAN [186], to name just a few. Moreover, the modern deep generative models, 
in particular GANs, have been characterized by the public media as magical black 
boxes which can generate anything from nothing. Therefore, one of the main goals 
of this chapter is to demystify the public belief of generative models by providing a 
coherent geometric picture of generative models. 

Specifically, our unified geometric view starts from Fig. 13.2. Here, the ambient 
image space is X, where we can take samples with the real data distribution jw. 
If the latent space is Z, the generator G can be treated as a mapping from the 
latent space to the ambient space, G : Z t» X, often realized by a deep network 
with parameter 0, i.e. G := Gg. Let ¢ be a fixed distribution on the latent space, 
such as uniform or Gaussian distribution. The generator Gg pushes forward ¢ to a 
distribution 4g = Goxé in the ambient space X (don’t worry about the term “‘push- 
forward” at this point, as it will be explained later). Then, the goal of the generative 
model training is to make jg as close as possible to the real data distribution ju. 
Additionally, for the case of autoencoding generative model, the generator works as 
a decoder, and there exists an additional encoder. More specifically, an encoder F 
maps from the sample space to the latent space F : X +» Z, parameterized by ¢, 
i.e. F = Fy so that the encoder pushes forward yz to a distribution f¢ = Fy# in the 
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latent space. Accordingly, the additional constraint is again to minimize the distance 
between fy and ¢. 

Using this unified geometric model, we can show that various types of generative 
models such as VAE, 6-VAE, GAN, OT, normalizing flow, etc. only differ in their 
choices of distances between j1g and yz or between fy and ¢, and how to train the 
generator and encoder to minimize the distances. 

Therefore, this chapter is structured somewhat differently from the conventional 
approaches to describing generative models. Rather than directly diving into specific 
details of each generative model, here we try to first provide a unified theoretical 
view, and then derive each generative model as a special case. Specifically, we 
first provide a brief review of probability theory, statistical distances, and optimal 
transport theory [182, 184]. Using these tools, we discuss in detail how each specific 
algorithm can be derived by simply changing the choice of statistical distance. 


13.2. Mathematical Preliminaries 


In this section, we assume that the readers are familiar with basic probability and 
measure theory [2]. For more background on the formal definition of probability 
space and related terms from the measure theory, see Chap. 1. 


Definition 13.1 (Push-Forward of a Measure) Let (X,¥, 4) be a probability 
space, let Y be a set, and let f : X }& Y be a function. The push-forward of wu 
by f is the probability measure v : f(F) +> [0, 1] defined by 


v(S) = w(f'(S)), (13.1) 


which is often denoted by v = fy. 

As an important example, a random variable X : (2+> M from a set of possible 
outcomes 2 to a measurable space M can be regarded as a push-forward of a 
measure. More specifically, on a probability space (Q, F, jz), a probability measure 
v that a random variable X takes on a set S C M is written as 

v(S) := v({X € S}) 
= pw ({@ € Q| X(@) € S}) 
= m(X~"(S)). (13.2) 


Accordingly, we can regard the random variable X as pushing forward the measure 
jon Q to a measure v on R. 
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Example (Push-Forward Measure) 
Consider Example 1.4. We now introduce a real-valued random variable: 


1, ifwo=H, 
X(@) = 
0, ifw=T. 


Then, the push-forward measure Q = XP is given by 


Q0@)=0, OCI} =05, Q({0})=0.5, Q({0, 1}) =1. 


We now define the Radon—Nikodym derivative, which is a mathematical tool 
to derive the probability density function (pdf) for the continuous domain, or 
probability mass function (pmf) for the discrete domain in a rigorous setting. This 
is also important in deriving the statistical distances, in particular, the divergences. 
For this, we need to understand the concept of an absolutely continuous measure. 


Definition 13.2 (Absolutely Continuous Measure) If j and v are two measures 
on any event set ¥ of Q, we say that v is absolutely continuous with respect to jz, or 
v < p, if for every measurable set A, 4(A) = 0 implies v(A) = 0. 


Figure 13.3a shows the case that v is not absolutely continuous with respect 
to jt, whereas Fig. 13.3b corresponds to a case where v < w. Beside being 
a prerequisite for the existence of a Radon—Nikodym derivative, the absolute 
continuity is important since it validates whether the use of a particular divergence 
is appropriate in designing a specific generative model. 


Theorem 13.1 (Radon-Nikodym Theorem) Let i and v be two measures on any 
event set F of Q. If hk < v, then there exists a non-negative function g on Q such 
that 


way = f ar | gdv, AEF. (13.3) 
A A 


The function g is called the Radon—Nikodym derivative or density of 1 w.r.t. v and 
is denoted by dA /dv. One of the popular Radon—Nikodym derivatives in probability 


(a) . (b) 


Fig. 13.3 (a) v is not absolute continuous w.r.t. uw. (b)v << uw 


13.2 Mathematical Preliminaries 271 


theory is the probability density function (pdf) or probability mass function (pmf) 
as discussed below. The Radon—Nikodym derivative is also a key to defining an 
jf-divergence as a statistical distance measure. 


Example (Radon-Nikodym Derivative for Discrete Probability Measure) 
Let aj < ay < --- bea sequence of real numbers and let p,,n = 1,2,---, 
be a sequence of positive numbers such that °°; Pn = 1. Then, 


F(x) = Deere) Oh ees eae 
; —0O <x <a 


(13.4) 


This is often called the discrete cumulative distribution function (cdf), and for 
this discrete case, it increases stepwise. Then, the corresponding probability 
measure is 


P(A) = > Di- (13.5) 


t:ajEA 


Let v be the counting measure. Then, 


P(A) = | fav= > fai). (13.6) 


aqeEA 


By inspection of (13.5) and (13.6), we can see that the Radon—Nikodym 
derivative is given by 


HG) =p FS, 2,o0. (13.7) 


which is often called the probability mass function (pmf). 


Example (Radon-Nikodym Derivative for Continuous Probability Mea- 
sure) 

Recall that the continuous domain cumulative distribution function (cdf) F is 
given by 


F(x) = [ f)dy, x ER, (13.8) 


(continued) 


272 13 Generative Models and Unsupervised Learning 


where f(y) is the probability density function (pdf). Then, the corresponding 
probability belonging to an interval A can be computed by 


P(A) = i f(y)dy (13.9) 


for any interval A. Therefore, we can easily see that the pdf f is the Radon— 
Nikodym derivative with respect to the Lebesgue measure. 


Although the Radon—Nikodym derivative is used to derive the pdf and pmf, it 
is a more general concept often used for any integral operation with respect to a 
measure. The following proposition is quite helpful for evaluating integrals with 
respect to a push-forward measure. 


Proposition 13.1 (Change-of-Variable Formula) Let (X, F, 14) be a probability 
space, and let f : X » Y be a function, such that a push-forward measure v is 
defined by v = fx. Then, we have 


[sav= | eo Fan. (13.10) 
Y x 


where o denotes the function composition. 


13.3 Statistical Distances 


As discussed before, the distance in the probability space is one of the key concepts 
for understanding the generative models. In statistics, a statistical distance quantifies 
the distance between two statistical objects, which can be two random variables, or 
two probability distributions or samples. The distance can be between an individual 
sample point and a population or a wider sample of points. 


13.3.1 f-Divergence 


Defining a metric in the probability space is often complicated, if not impossible. 
Therefore, relaxed forms of the metric are often used. For example, the statistical 
distances that satisfy 1) and 2) of Definition 1.1 are referred to as divergences, and 
are quite often used in statistics and machine learning. One of the most widely 
used forms of divergence in machine learning is f-divergence, which is defined 
as follows. 
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Definition 13.3 (f-Divergence) Let j and v are two probability distributions over 
a space Q such that 4. < v. Then, for a convex function f such that f(1) = 0, the 
f-divergence of jz from v is defined as 


d 
Dy(ullv) =| f (+) dv, (13.11) 
Q v 


where djz/dv is the Radon—Nikodym derivative w.r.t v. If u<« € andv « & fora 
common measure & on &, then their probability densities p and q satisfy du = pd& 
and dv = qdé. In this case the f-divergence can be written as 


Dy (P||Q) =f r(22)acaeen, (13.12) 


One thing which is very important and should be treated carefully is the condition 
ju < v. For example, if jz is the measure of the original data and v is the distribution 
for the generated data, their absolute continuity w.r.t each other should be checked 
first to choose a right form of divergence. 

For the discrete case, when Q(x) and P(x) become the respective probability 
mass functions, then the f-divergence can be written as 


P 
Ds (PIIQ) = Lows (F%). (13.13) 


Depending on the choice of the convex function f, we can obtain various special 
cases. Some of the representative special cases are as follows. 


13.3.1.1. Kullback—Leibler (KL) Divergence 
The corresponding generator f is given by 
f@ =tlogt. 


In the discrete case, KL divergence can be represented by 


P(x), P(x) 
log 
x 


DrL(PI|Q) = LOW a 5 98 Ga) 


P(x) 
= a P(x) log 0G) 


= — 5 (P(x) log Q(x) — P(x) log P(x)) 


= H(P, Q) — H(P), (13.14) 
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where H(P, Q) is the cross entropy of P and Q, and H(P) is the entropy of P: 


H(P) = —)° P(x) log P(x), (13.15) 
H(P,Q) =—)~ P(x) log Q(x). (13.16) 


Therefore, KL divergence is often called the relative entropy. 


13.3.1.2 Jensen—Shannon (JS) Divergence 


This corresponds to a special case of f-divergence with the generator 


f@) = (t+ I log (=) +tlogt. 


Using this, we can show that JS divergence is closely related to the KL divergence 
as: 


1 1 
Djs(P||Q) = 5 Di (PIM) + 5 Dxi(QI|M), (13.17) 


where M = (P+ Q)/2. 

Note that JS divergence has important advantages over KL divergence. Since 
M = (P + Q)/2, we can always guarantee P < M and Q < M. Therefore, 
the Radon—Nikodym derivative dP/dM and dQ/dM are always well-defined and 
the f-divergence in (13.11) can be obtained. On the other hand, to use the KL 
divergence Dx,(P||Q) or Dxx(Q||P), we should have P < QorQ <« P 
respectively, which is difficult to know a prior in practice. 

The generators for other forms of f-divergence are defined in Table 13.1. Later, 
we will show that various types of GAN architecture emerge depending on the 
choice of the generator. 


13.3.2. Wasserstein Metric 


Unlike the f-divergence, the Wasserstein metric is a metric that satisfies all four 
properties of a metric in Definition 1.1. Therefore, this becomes a powerful way of 
measuring distance in the probability space. For example, to define an f-divergence, 
we should always check the absolute continuity w.r.t. each other, which is difficult 
in practice. In the Wasserstein metric, such hassles are no longer necessary. 

Let (M, d) be a metric space with a metric d. For p = 1, let P,(M) denote the 
collection of all probability measures 4 on M with a finite p-th moment. Then, the 
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p-th Wasserstein distance between two probability measures jz and v in Py(M) is 
defined as 


1/p 
Wp (u,v) = inf d(x, y)Pdm(x, y) (13.18) 
mell(u,v) JMxM 
I/p 
=( inf Bs (a(x. ¥)"]) ; (13.19) 
well(p,v) 


where ITI(jz, v) denotes the collection of all measures on M x M with marginals jw 
and v on the first and second factors respectively, and X, Y are the random vectors 
with the joint distribution 7, and E,[-, -] is the expectation with respect to the joint 
measure z defined by 


Ex [f(X, Y)] = f(x, y)dn(x, y). (13.20) 
MxM 


When p = 1, this is often called the “earth-mover distance” or Wasserstein-1 metric. 
In the following, we provide some examples where the closed form solution for the 
Wasserstein distance in (13.18) can be obtained. 


Example: 1-D Cases 
Let jw and v denote the 1-D probability measure with the cumulative 
distribution functions, F and G, respectively. Then, we have 


1 
1 r 
wou. =(f IF) - G1@az) (13.21) 
0 


Example: Normal Distribution 
If uw ~ N(my, X41) and v ~ N(mz, Xo) are two normal distributions. Then, 
we have 

Wo(, v) = lm — ma||? + B?(X), Xa), (13.22) 
where 


1/2 
BY(Zy, Ba) = THE) + THs) — 2TH (Ey!?HE)) |: (13.23) 


where Tr(-) denotes the matrix trace. 
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In general, a direct computation of the distance in (13.18) is often difficult. The 
following section shows that there exists a more manageable way of computing the 
Wasserstein metric through the dual formulation. In fact, this leads to the theory of 
optimal transport [182, 184]. 


13.4 Optimal Transport 
13.4.1 Monge’s Original Formulation 


Optimal transport provides a mathematical means to operate between two probabil- 
ity measures [182, 184]. Formally, we say that T : X t} Y transports a probability 
measure 4 € P(X) to another measure v € P(Y), if 


v(B) =p (7-(B)) , forall v-measurable sets B, (13.24) 


which is simply the push-forward of the measure, i.e., v = Ty. See Fig. 13.4 for 
an example of an optimal transport. 

Suppose there is a cost function c : Xx Y + RU{oo} such that c(x, y) represents 
the cost of moving one unit of mass from x € X to y € Y. Monge’s original OT 
problem [182, 184] is then to find a transport map T that transports jz to v at the 
minimum total transportation cost: 


min M(T) := feo@, T (x))du(x) (13.25) 
subject to v = Ty. 


The nonlinear push-forward constraint v = Ty, is difficult to handle and sometimes 
leads to a void T due to assignment of indivisible mass [182, 184]. 

In the following, we provide some examples where the closed form solution for 
the optimal transport map can be obtained. 


Fig. 13.4 Optimal transport 
from a distribution (measure) pL t rans p ort V 


to another measure v 
= — 


x ¥ 
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Example: 1-D Cases 
Using the change of variable x = F~!(z), the Wasserstein-p metric in (13.21) 
can be represented by: 


1 
1 P 
Wp(H, v) = (/ |F7"(z) - G"(eIPaz) 
0 


al 
= (/ pe G-\FeoplPaFs)) a7 (13.26) 
R 


Therefore, for the given transport cost c(x, y) = |x — y|?, we can see that 
Monge’s optimal transport map is given by 


T(x) = G7!(F(x)). 


Example: Normal Distribution 
If uw ~ N(m,, 1) and v ~ N(mpz, 2) are two normal distributions. Then, 
the optimal transport map 7#j. = v is given by 

T: xt m2+ A(x —™m)}), (13.27) 
where 


Wee 
Ae ey 2 ae (13.28) 


In particular, if &; = 0,7 and Xz = o2/, then the optimal transport map is 
given by 


Tix ty m+ —(@— mi). (13.29) 
1 


13.4.2. Kantorovich Formulation 


Kantorovich relaxed the original OT by considering probabilistic transport that 
allows mass splitting from a source toward several targets [182, 184]. Specifically, 
Kantorovich introduced a joint measure 7 € P(X x Y) such that the original 
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problem can be relaxed as 


min Sexy cx, yd (x, y) (13.30) 


subjectto m2(A x Y)= (A), m(X x B) = v(B) 


for all measurable sets A € X and B € UY. Here, the last two constraints come from 
the observation that the total amount of mass removed from any measurable set has 
to be equal to the marginal distributions [182, 184]. 

Another important advantage of the Kantorovich formulation is the dual formu- 
lation, as stated in the following theorem: 


Theorem 13.2 (Kantorovich Duality Theorem) //82, Theorem 5.10, pp.57-59] 
Let (X, 4) and (Y, v) be two probability spaces and letc : X x Y > Rbea 
continuous cost function, such that |c(x, y)| < cx(x)+cy() for some cx € L!(w) 
and cy € L'(v), where L'(w) denotes a Lebesgue space with an integral function 
with the measure \t. Then, there is a duality: 


min i, cle, y)da(x, y) 
xXxY 


mell(u,v) 
= sup { f eerane+ fo oravo} (13.31) 
geL (uy 7X Y 
= sup [[ vecoaue+ f vido}, (13.32) 
weL!(u) YX y 
where 
(u,v) := {a | aw(Ax Y)=p(A), a(X x B) = v(B)}, (13.33) 


and the above maximum is taken over the so-called Kantorovich potentials g@ and 
w, whose c-transforms are defined as 


y°(y) = inf{e@a, y) — pO)}, (13.34) 


w(x) = infle(x, y) — W(y)}. (13.35) 


In the Kantorovich dual formulation, the computation of the c-transform 9° is 
important. In the following, we show several important examples. 
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In particular, when c(x, y) = ||x — y|l, we can reduce possible candidates of 
gy to 1-Lipschitz functions so that we can simplify y° to —@ [182]. Using this, the 
Wasserstein-! norm can be represented by 


Wil, v) = gain [lbs alld. 9) (13.36) 
= sup | / roy viene / erdvoy}. (13.37) 
ye Lip, (X) x x 


where Lip; (X) = {g € L!(w) : g(x) — g(y)| < ||x — y||}. Compared to the primal 
form (13.36) which requires the integration with respect to the joint measures, 
the dual formulation in (13.37) just requires marginals jz and v, which make the 
computation much more tractable. This is why the dual form is more widely used in 
generative models. 


13.4.3 Entropy Regularization 


Another way to address optimal transport problems in a computationally feasible 
way is by using the so-called Sinkhorn distance [183]. Rather than solving the 
dual problem, the main idea is to use entropy regularization with respect to the 
joint distribution z so that the optimal transport map can be found by solving 
a regularized primal problem. As the paper title indicates (“Sinkhorn distances: 
Lightspeed Computation of Optimal Transport’) [183], the introduction of the 
entropy regularization leads to a computationally efficient optimization problem. 

Although the original formulation is for the discrete measure, here we provide 
a continuous formulation of the Sinkhorn distances to use the similar notation 
as before. More specifically, the continuous-domain entropy regularized optimal 
transport is formulated by [187] 


inf i c(x, yd (x, y) + yf m(x, y)(log w(x, y) — 1)d(x, y), 
well(u,v),7>0 Jxyy XxxY 
(13.38) 
where II(, v) denotes the set of joint distributions whose marginal distributions 


are p(x) and v(y), respectively. Then, the following proposition shows that the 
associate dual problem has very interesting formulation. 


Proposition 13.2. The dual of the primal problem in (13.38) is given by 


sup | ooran(s) + | wondvy 7 | exp (SE FOO) 90» | 135. 
o,9 JX Yy Xxxy Y 
(13.39) 
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Proof Using the convex conjugate formulation in Chap. 1, we know that e* is the 
convex conjugate of x log x — x for x > 0. Accordingly, we have 


sup | odu+ | pdv—y f exp (—"**) d(x, y) 
bo JX Y XxxY Y 


=sp fod av + | A dn(c—¢—9)+y(a loga —m)d(x, y) 
x 


xy m>0 


= inf / cx + ym(loga — 1)d(x, y) 
XxYy 


m>0 


+ inf sup | dd — bdr + | ody — | gdn. 
t>0 bo XxYy Y Xxxy 


Under the constraint that z € I(, v), the last four terms vanish. Therefore, we 
have 


sup | olordu(a) + f wondry—y f exp (EPO) 400.9) 
o.p JX Y Xxy WA 


inf / ote. idee +7 f w(x, y)(log w(x, y) — 1d, y). 
mel (u,v), 7>0 I xxy XxY 


This concludes the proof. oO 


The Sinkhorn distance formulation can then be obtained by the change of 
variables for the dual problem (13.39). Specifically, for ¢,g@ > 0, consider the 
following change of variables: 


g(x 


(x) 
a(x)=ev ,PY=er, (13.40) 


which leads to 
sup y / ineanduaysy / log B(y)dv(y) — y / as) exp (—" = 2) pond, me 
a,p x Y xXxy Y 

(13.41) 


Using the variational calculus, for a given perturbation a > a +€6a, the first-order 
variation is given by 


[= EE 5, [se f exp (-" a “O) prayer (13.42) 
x a(x) dx 


56.5 - 
=f 3 ey emt )— [exp -)) pryay\dx=0. (13.43) 
a dx y 
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Thus, we have 


aK (x) 
a(x) = aa . (13.44) 
[y exp (—) Body 
Similarly, we have 
dv 
=(y) 
BQ) = ca (13.45) 


Jy exp (- cu) a(x)dx 


In fact, the update rule (13.44) and (13.45) are the main iterations for Sinkhorn’s 
fixed point iteration [183]. 


13.5 Generative Adversarial Networks 


With the mathematical backgrounds set, we are now ready to discuss specific forms 
of the generative models, and explain how they can be derived in a unified theoretical 
framework. In this section, we will mainly describe the decoder-type generative 
models, which we simply call generative models. Later, we will explain how this 
analysis can be extended to the autoencoder-type generative models. 


13.5.1 Earliest Form of GAN 


The original form of generative adversarial network (GAN) [88] was inspired by the 
success of discriminative models for classification. In particular, Goodfellow et al. 
[88] formulated generative model training as a minimax game between a generative 
network (generator) that maps a random latent vector into the data in the ambient 
space, and a discriminative network trying to distinguish the generated samples from 
real samples. Surprisingly, this minimax formation of a deep generative model can 
transfer the success of deep discriminative models to generative models, resulting in 
significant improvement in generative model performance [88]. In fact, the success 
of GANs has generated significant interest in the generative model in general, which 
has been followed by many breakthrough ideas. 

Before we explain the geometric structure of the GAN and its variants from a 
unified framework, we briefly present the original explanation of the GAN, since it 
is more intuitive to the general public. Let X and Z denote the ambient and latent 
space equipped with the measure yz and ¢, respectively (recall the geometric picture 
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in Fig. 13.2). Then, the original form of the GAN solves the following minimax 
game: 


min max £gan(D, G), (13.46) 
G D 


where 
écan(D, G) := E, [log D(x)] + Ez [log(1 — D(G(z))], 


where D(x) is the discriminator that takes as input a sample and outputs a scalar 
between [0, 1], G(z) is the generator that maps a latent vector z to the ambient 
space vector, and 


Eu [log D(x)] = [8 D(x)du(x), 
E; [log(l — D(G(z)))] = [vos — D(G(z)))dé(z). 


The meaning of (13.46) is that the generator tries to fool the discriminator, while 
the discriminator wants to maximize the differentiation power between the true 
and generated samples. In GANs, the discriminator and generator are usually 
implemented as deep networks which are parameterized by network parameters ¢ 
and 0, i.e. D(x) := Dg(x), G(z) := Go(z). Therefore, (13.46) can be formulated 
as a minmax problem with respect to 6 and ¢. 

Figure 13.5 illustrates some of the samples generated by GANs from this minmax 
optimization that appeared in their original paper [88]. By current standards, the 
results look very poor, but when these were published in 2014, they shocked 
the world and were considered state-of-the art. We can again see the light-speed 
progress of generative model technology. 

Since it was first published, one of the puzzling questions about GANSs is the 
mathematical origin of the minmax problems, and why it is important. In fact, the 
pursuit of understanding such questions has been very rewarding, and has led to the 
discovery of numerous key results that are essential toward the understanding of the 
geometric structure of GANs. 

Among them, two most notable results are the f-GAN [176] and Wasserstein 
GAN (W-GAN) [177], which will be reviewed in the following sections. These 
works reveal that the GAN indeed originates from minimizing statistical distances 
using dual formulation. These two methods differ only in their choices of statistical 
distances and associated dual formulations. 
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Fig. 13.5 Examples of GAN-generated samples in [88]. The rightmost columns show the nearest 
training example of the neighboring sample, in order to demonstrate that the model has not 
memorized the training set. These images show actual samples from the model distributions, 
not conditional means given samples of hidden units. (a) TFD, (b) MNIST, (c) CIFAR-10 (fully 
connected model), (d) CIFAR-10 


13.5.2 f-GAN 


The f-GAN [176] was perhaps one of the most important theoretical results in the 
early history of GANs, and clearly demonstrates the importance of the statistical 
distances and dual formulation. As the name suggests, the f-GAN starts with f- 
divergence. 

Recall that f-divergence is defined by 


is id 
pyuir)= fr (Se) ae (13.47) 


if ~ < v. The main idea of the f-GAN (which includes the original GAN) is to 
use f-divergence as a statistical distance between the real data distribution X with 
the measure jz and the synthesized data distribution in the ambient space X with the 
measure v := {4g So that the probability measure v gets closer to yz (see Fig. 13.2, 
where jg is now considered as v for notational simplicity). The key observation 
is that instead of directly minimizing the f-divergence, something very interesting 
emerges if we formulate its dual problems. More specifically, the author exploits the 
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following dual formulation of the f-divergence [176], whose proof is repeated here 
for educational purposes. Recall the following definition of a convex conjugate (for 
more detail, see Chap. 1): 


Definition 13.4 ((6]) For a given function f : J # R, its convex conjugate is 
defined by 


w= aes — f(r}. (13.48) 
TE 


If f is a convex function, the convex conjugate of its convex conjugate is the 
function itself, i.e. 


ioQ= Ff" = sues — f*()}, (13.49) 
tel* 


if f* : I* tH R. This is the property we need in the following lemma. 


Lemma 13.1 ({176]) Let 4. <« v. Then, for any class of functions t mapping from 
X to R, we have the lower bound 


Dy (ul|v) = sup ff roddutx)— f f*(t(x))dv(x), (13.50) 
tEel* JX xX 


where f* : I* +> R is the convex conjugate of f. 


Proof The proof is a simple consequence of the convex conjugate. More specifi- 
cally, we have 


pf, (u 
pjuilv) = f (4) av 
/ sup {s - °c) ay 
Xrel | dv 
sup ff {ee — rc} ay 
tEl* JX dv 
sup ff td — f*(t)dv 
tEel* JX 


= sup / t(x)dp(x) — i f* (tx) dv(x). 
x x 


tel* 


IV 
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The lower bound in (13.50) is tight and can be achieved at 


d 
ep ag |, (13.51) 
dv q(x) 
where the last equality holds when du = pd& and dv = qdé for common measure 
& [176]. 
While the lower bound in (13.50) is intuitive, one of the complications in the 


derivation of the f-GAN is that the function t should be within the domain of f*, 
i.e. t € I*. To address this, the authors in [176] proposed the following trick: 


T(x) = gf(VQ)), (13.52) 
where V : X +> R without any constraint on the output range, and gf : Rte I* 
is an output activation function that maps the output to the domain of f*. Then, the 


Jf -GAN can be formulated as follows: 


min max £ ¢gan(G, gf), (13.53) 
G &8f ~ ; 


where 
l Gan (G, 8) = Ey [g¢(V(x))] — Ee [f* (ep ViG@)))]. 
For example, if we choose 
f@® =-@+ Dlog?t +1) +1logt, 
then its convex conjugate is given by 
f* (Wu) = sup {ut + (t + 1) log(t + 1) — tlog?r} 
teR + 
= — log(1 — e"). 


The domain of the conjugate function f* should be R_ in order to make the 1—e” > 
0. One of the functions gf to allow this is given by 


1 . 
gf(V) = log (<7) = log Sig(V), 


where Sig(-) is the sigmoid function. Accordingly, we have 


f*(g/(V)) = — log (1 — e858) — — log(l — Sig(V)). 
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Therefore, if we use a discriminator with the sigmoid being the last layer we have 
D(x) = Sig(V (x)) and this leads to the following f-GAN cost function: 


sup i; t(x)d p(x) — / f*(r(x))dv(x) 
x x 


tel* 


= sup | ey (Vex)yd(x) | f(g ¢(V(x)))dvQx) 
ep VIX xX 


= sup [ log D(x)du(x) + log — D(x))dv(x). 
D ISX x 


Finally, the measure v is for the samples from latent space Z with the measure ¢ 
by generator G(z), z € Z, so v is the push-forward measure Gy¢ (see Fig. 13.2). 
Using the change-of-variable formula in Proposition 13.1, the final loss function is 
given by 


£(D, G) := sup | log D(x)du(x) 
D JX 


+ , log(1 — D(G()))dé(x). 


This is equivalent to the original GAN cost function. 
By changing the generator f, we can now obtain various types of GAN variants. 
Table 13.1 summarizes various forms of the f-GAN. 


13.5.3 Wasserstein GAN (W-GAN) 


Note that the f-GAN interprets the GAN training as a statistical distance mini- 
mization in the form of dual formulation. However, its main limitation is that the 
jf -divergence is not a metric, limiting the fundamental performance. 

A similar minimization idea is employed for the Wasserstein GAN, but now with 
a real metric in probability space. More specifically, the W-GAN minimizes the 
following Wasserstein-1 norm: 


Wi(P, Q):= min / IIx — x’ |lda(x, x’), (13.54) 
wEeT(u,v) JXXX 


where X is the ambient space, jz and v are measures for the real data and generated 
data, respectively, and x(x, x’) is the joint distribution with the marginals jw and v, 
respectively (recall the definition of II(u, v) in (13.33)). 
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Similar to the f-GAN, rather than solving the complicated primal problem, a 
dual problem is solved. Recall that the Kantorivich dual formulation leads to the 
following dual formulation of the Wasserstein 1-norm: 


Wi(u.v)= sup | I y(x)d u(x) — E e(x))dv(x')}, (13.55) 


geLip, (X) 


where Lip, (X) denotes the 1-Lipschitz function space with domain X. Again, the 
measure v is for the generated samples from latent space Z with the measure ¢ 
by generator G(z),z € Z, so v can be considered as the push-forward measure 
v = Gy. Using the change-of-variable formula in Proposition 13.1, the final loss 
function is given by 


Witw,v)= sup | i, p(x)du(x) — iE e(G(2))dz(2)} (13.56) 


peLip, (x) 


Therefore, the Wasserstein l1-norm minimization problem can be equivalently 
represented by the following minmax formulation: 


min Wj (j, v) 
v 


= min max { [ edu — [ eGendce}, 


G geLip, (x) 


where G(z) is called the generator, and the Kantorovich potential g is called the 
discriminator. 

Therefore, imposing al-Lipschitz condition on the discriminator is necessary in 
the W-GAN [177]. There are many approaches to address this. For example, in the 
original W-GAN paper [177], weight clipping was used to impose a 1-Lipschitz 
condition. Another method is to use spectral normalization [188], which utilizes the 
power iteration method to impose a constraint on the largest singular value of the 
weight matrix in each layer. Yet another popular method is the W-GAN with the 
gradient penalty (WGAN-GP), where the gradient of the Kantorovich potential is 
constrained to be 1 [189]. Specifically, the following modified loss function is used 
for the minmax problem: 


Lw—Gan(G; @) (13.57) 


= (/ g(x)du(x) -[ o(Gte)de(2)) 
x Le 


—n iE (IVzg(x)ll2 — 1)?du(x), 
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where 7 > 0 is the regularization parameter to impose a 1-Lipschitz property on the 
discriminators, and x = ax + (1 — @)G(z) with w being random variables from the 
uniform distribution between [0, 1] [189]. 


13.5.4 StyleGAN 


As mentioned before, one of the most exciting developments in CVPR 2019 was 
the introduction of novel generative adversarial network (GAN) called StyleGAN 
by Nvidia [89], which can produce very realistic high-resolution images. 

Aside from various sophisticated tricks, StyleGAN also introduced impor- 
tant innovations from a theoretical perspective. For example, one of the main 
breakthroughs of styleGAN comes from AdaIN. The neural network in Fig. 13.6 
generates the latent codes that are used as style image feature vectors. Then, 


Generator network g 


! 
! 
Const. 512x4x4 
! 
! 


Latent vector 


pace ae | 


Normalize 


Fig. 13.6 Architecture of StyleGAN 


13.6 Autoencoder-Type Generative Models 291 


the AdaIN layer combines the style features and the content features together to 
generate more realistic features at each resolution. 

Yet another breakthrough idea is that SytleGAN introduces noise into each layer 
to create stochastic variation, as shown in Fig. 13.6. Recall that most of the GANs 
starts with the simple latent vector z in the latent space as an input to the generator. 
On the other hand, the noise at each layer of StyleGAN can be considered as a more 
complicated latent space, so that a mapping from a more complicated input latent 
space to the data domain produces more realistic images. In fact, by introducing a 
more complicated latent space, styleGAN enables local changes in the pixel level 
and targets stochastic variation in generating local variants of features. 


13.6 Autoencoder-Type Generative Models 


Although we have already discussed the generative model such as the GAN, his- 
torically the autoencoder-type generative model precedes the GAN-type models. In 
fact, the autoencoder-type generative model goes back to the denoising autoencoder 
[190], which is a deterministic form of encoder—decoder networks. 

The real generative autoencoder model in fact originates from the variational 
autoencoder (VAE) [174], which enables the generation of the target samples by 
changing latent variables using random samples. Another breakthrough in the VAE 
comes from the normalizing flow [178-181], which significantly improves the 
quality of generated samples by allowing invertible mapping. In this section, we 
review the two ideas in a unified geometric framework. To do this, we first explain 
the important concept in variational inference—the evidence lower bound (ELBO) 
or the variational lower bound [191]. 


13.6.1 ELBO 


In variational inference such as VAE, our model distribution pg (x) is obtained 
by combining a simple distribution p(z) with a family of conditional distributions 
Po(x|z), so that our objective is written as 


log pe(x) = log ¢ Do(x, ode) 


= log ( / polle)p(e)dz) : (13.58) 


Here, the goal is to find the parameter 0 to maximize the loglikelihood using the 
given data set x € X. 
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Although p(z) and pg(x|z) will generally be simple by choice, it may be 
impossible to compute log pg (x) analytically due to the need to solve the integral 
inside the logarithm. A trick to address this problem is to introduce a distribution 
qo(z|x) parameterized by ¢ and conditioned on x such that 


P(Z) 
qo (Z|x) 


> [roe (rosin 2) gy (Zlx)dz, 
qo (Z|x) 


log po(x) = log (/ Po (x|z) ao(cls)az) 


where we use Jensen’s inequality [192]. Accordingly, we have 


46 (zx) 
P(Z) 


ioe eG) = i log pa(rlzdag(elx)dz — i: oe ( ) aotelsyae 


= / log po(x|z)qg(zlx)dz — Dex (qglx) ||P), 


which is often called the evidence lower bound (ELBO) or the variational lower 
bound [191]. 

Since the choice of posterior gg (z|x) could be arbitrary, the goal of the variational 
inference is to find gg to maximize the ELBO, or, equivalently, minimize the 
following loss function: 


LerBo(x; 0, b) = — / log pa(x|z)gg(zlx)dz + Dr (qo (zx) ||p(z)), 
(13.59) 


where the first term is the likelihood term and the second KL term can be interpreted 
as the penalty term. Then, variational inference tries to find 0 and ¢@ to minimize the 
loss for a given x, or average loss for all x. 


13.6.2. Variational Autoencoder (VAE) 


Using the ELBO, we are now ready to derive the VAE. However, our derivation is 
somewhat different from the original derivation of the VAE [174], since the original 
derivation makes it difficult to show the link to normalizing flow [178-181]. The 
following derivation originates from the f-VAE [193]. 

Specifically, among the various choices of gg(z|x) for the ELBO, we choose the 
following form: 


qo (zZ|x) = [oc - Fy (u))r(u)du, (13.60) 
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Fig. 13.7 Variational (a) 
autoencoder architecture: (a) 

general form, (b) original 

VAE, and (c) invertible flow 


rt? = o(z) Ou+ p(z) £ 
u~ N(0,1) 
(c) | 
= z 


where r(u) is the standard Gaussian, and F’ Z (u) is the encoder function for a given 
x which has another noisy input vu. See Fig. 13.7a for the concept of the encoder 
F Zé (u). For the given encoder function, we have the following key result for the 
ELBO loss. 


Proposition 13.3 For the given encoder in (13.60), the ELBO loss in (13.59) can 
be represented by 


LetBo; 6, d) = f tog pati F3 (oyna 
r(u) 
OF G (u) 
~ f t08 det = r(u)du. (13.61) 


Proof Let us start with the ELBO: 


P(Z) 
g¢(Z|x) 


LerBo(x; 0, ob) = ic (votxte ) aotelons, 
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which can be represented by 
CerBo(x, ®) aI (log (po (lz) p(z)) — log gg (zlx)) ag (Zlx)dz. (13.62) 
Using the encoder representation in (13.60), the first term of (13.62) becomes 
[ [ve (po (x|z) p(z)) 6(z — Fy (u))r(u)dudz 

= / log (po(x1Fj (w)) pUFS (w)) r(w)du 

=) log po(x| Fy (u))r(u)du + : log p(Fg (u))r(u)du. 
Similarly, the second term of (13.62) becomes 

[ [ve (/ b(z — Fy (rau) o(z — Fy (u))r(u)dudz 


= [oe i 8(F3 (u) — F3(u!)yr a) r(u)du. 
Now, using the following change of variables: 
v= F3(u'), u= Ax(v), 


the corresponding Jacobian determinant is given by 


d (+) 1 1 
e — => 3 
d Bfasl 
dv} det (7) det (“Hr”) 


au! 


Then, we have 


[ve (/ 5(F5 (u) — Fj(u!)yrdu') r(u)du 


= | 102 foese sp SA) es'| ade 


OF* (uw 
u 
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r(u)du 


i r(Hx(F}(u))) 
= | log | 


FX (ul 
det ( a ‘) 
u 
v=Fj (u) 


OF 5 (u) 
= [roerwyrandu— J t08 det aa 


By collecting terms together, we have 
Cerpo(, >) = - [108 po (x| Fg (u))r(u)du 
+ f t0¢( SP) rand 
og | ———— ] r(u)du 
p(Fi(u)) 


(=) 
- | 108 det 
du 


r(u)du. 


This concludes the proof. 


r(u)du. 
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oO 


Proposition 13.3 is a universal result that can be applied to the VAE, normalizing 
flow, etc. Major differences between them come from the choice of the encoder 
F ’ (u). In particular, for the case of the VAE [174], the following form of the encoder 


function F $ (u) is used: 


z= Fi(u) = Ug(a) +oe(2) Ou, u~ NO, 1a), 


(13.63) 


where J, is the d x d identity matrix and d is the dimension of the latent space. 
This was referred to as the reparameterization trick in the original VAE paper [174]. 


Under this choice, the second term in (13.61) becomes 


fro ce r(u)du 
*\ DEW) 


1 2 1 2 
== fie rwdut f hu(s) +o(x) oul ids 


l d 
=5 oP) + H7@) — D, 


i=1 


(13.64) 
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whereas the third term becomes 


= f tog 


Finally, the first term in (13.61) is the likelihood term, which can be represented as 
follows by assuming the Gaussian distribution: 


dF (u) ie 
oa ( . )|rooda = $Y too U1S289) 


7 ij log po (x|F3 (u))r(w)du 
1 : 
= ; 5llx — Gol Fg w))IPrwdu 
1 
= 5 fis — Ga (ug (x) + g(x) © u)|I?r (udu. (13.66) 


Therefore, the encoder and decoder parameter optimization problem for the VAE 
can be obtained as follows: 


min fy 6,0 7 
¢ AE( ) 
where 


1 
lvaE@, ¢) = s/f lx — Go(ug(x) + o4(x) © u)|’r(w)dudu(x) 


d 
+ De i (67 (x) + uj (x) — logo? (x) — Ddu(x). (13.67) 


Once the neural network is trained, one of the very important advantages of the 
VAE is that we can simply control the decoder output by changing the random 
samples. More specifically, the decoder output is now given by 


£(u) = Go(ug(x) + og(x) Ou), (13.68) 


which has an explicit dependency on the random variable u. Therefore, for a given 
x, we can change the output by drawing sample uw. 


13.6.3 B-VAE 


By inspection of VAE loss in (13.67), we can easily see that the first term represents 
the distance between the generated samples and the real ones, whereas the second 
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term is the KL distance between the real latent space measure and posterior 
distribution. Therefore, VAE loss is a measure of the distances that considers both 
latent space and the ambient space between real and generated samples. 

In fact, this observation nicely fits into our geometric view of the autoencoder 
illustrated in Fig. 13.2. Here, the ambient image space is X, the real data distribution 
is 4, whereas the autoencoder output data distribution is 4g. The latent space is 
Z. In the autoencoder, the generator Gg corresponds to the decoder, which is a 
mapping from the latent space to the sample space, Gg : Z + X, realized by a 
deep network. Then, the goal of the decoder training is to make the push-forward 
measure [4g = Ggxt as close as possible to the real data distribution . Additionally, 
an encoder Fg maps from the real data in X to the latent space Fg : X +> Z so that 
the encoder pushes forward the measure ju to a distribution ¢¢ = Fy in the latent 
space. Therefore, the VAE design problem can be formulated by minimizing the 
sum of the both distances, which are measured by average sample distance and KL 
distance, respectively. 

Rather than giving uniform weights for both distances, 6-VAE [175] relaxes 
this constraint of the VAE. Following the same incentive in the VAE, we want 
to maximize the probability of generating real data, while keeping the distance 
between the real and estimated posterior distributions small (say, under a small 
constant). This leads to the following 6-VAE cost function: 


lp-vaE(O, ) (13.69) 


1 
=5 |) fx — Goto) + p(x) OP rondudyes) 


d 
+Ey i: (o2(x) + p2(x) — logo2(x) — Ddu(x), 


i=1 


where £ now controls the importance of the distance measure in the latent space. 
When 6 = 1, it is the same as the VAE. When f > 1, it applies a stronger constraint 
on the latent space. 

As a higher 6 imposes more constraint on the latent space, it turns out that the 
latent space is more interpretable and controllable, which is known as the disen- 
tanglement. More specifically, if each variable in the inferred latent representation 
z is only sensitive to one single generative factor and relatively invariant to other 
factors, we will say this representation is disentangled or factorized. One benefit 
that often comes with disentangled representation is good interpretability and easy 
generalization to a variety of tasks. For some conditionally independent generative 
factors, keeping them disentangled is the most efficient representation, and B- 
VAE provides more disentangled representation. For example, the generated faces 
from the original VAE have various directions, whereas they are toward specific 
directions in the 8-VAE, implying that factors for the face directions is successfully 
disentangled [175]. 
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Fig. 13.8 Concept of normalizing flow 


13.6.4 Normalizing Flow, Invertible Flow 


The normalizing flow (NF) [178-181] is a modern way of overcoming the limitation 
of VAE. As shown in Fig. 13.8, normalizing flow transforms a simple distribution 
into a complex one by applying a sequence of invertible transformation functions. 
Flowing through a chain of transformations, we repeatedly substitute the variable 
for the new one according to the change-of-variables theorem and eventually obtain 
a probability distribution of the final target variable. Such a sequence of invertible 
transformations is the origin of the name “normalizing flow” [179]. 

The derivation of the cost function for a normalizing flow also starts with the 
same ELBO and encoder model in (13.60). However, the normalizing flow chooses 
a different encoder function: 


Z= Fy (u) = Fy(ou+x), (13.70) 


where F% is an invertible function. Here, the invertibility is the key component, 
so the algorithm is often called the invertible flow. Specifically, if we choose the 
decoder as the inverse of the encoder function, i.e. Gg = F "3 a very interesting 
phenomenon happens. More specifically, the first term in (13.61) can be simplified 
as follows: 


- ic Po(x| Fy (u))r(u)du 
1 
= / lx — Go (FS (w))|?r(wdu 
1 
= >| |x — Go(Fo(ou + x))|?r(u)du 
o2 


fi I?rw)du = > 
= — oO —] 
) uj; rlujau a? 
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which becomes a constant. Therefore, it is no longer necessary to consider the 
decoder part in the parameter estimation. Accordingly, aside from the constant term, 
the ELBO loss in (13.61) can be simplified as 


tjiow(s. 0) =~ | tog (pFSW0) rendu 


(= 
- | 108 det 
du 


where we have also removed the f logr(u)r(u)du term since this is also a constant. 
For the Gaussian assumption for p(z), (13.71) can be further simplified as 


r(u)du, (13.71) 


1 
tjtowl. 9) =5 / Fy(ou + x)IPrQdu 
= | 108 det (Peet?) ana (13.72) 
Uu 


Now the main technical difficulty of NF arises from the last term, which involves 
a complicated determinant calculation for a huge matrix. As discussed before, NF 
mainly focuses on the encoder function F¢ (and, likewise, the decoder G), which is 
composed of a sequence of transformations: 


Fy(u) = (hk oh 0-+-0h1)(u), (13.73) 


Using the change-of-variable formula, 


Sie ere 03.74) 
we have 
0 Fy (u) = ah; 
togaet( a = Lee det ()], (13.75) 


where hg = u. Therefore, most of the current research efforts for NF have focused 
on how to design an invertible block such that the determinant calculation is simple. 
Now, we review a few representative techniques. 

NICE (nonlinear independent component estimation) [178] is based on learning 
a non-linear bijective transformation between the data space and a latent space. The 
architecture is composed of a series of blocks defined as follows, where x; and x 
are a partition of the input in each layer, and y; and y2 are partitions of the output. 
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Then, the NICE update is given by 


y=, 


y2 = x2 + F(x1), (13.76) 
where F(-) is a neural network. Then, the block inversion can be readily done by 


x1 = Yi, 


x2 = y2 — F(y1). (13.77) 


Furthermore, it is easy to see that its Jacobian has a unit determinant and the cost 
function in (13.72) and its gradient can be tractably computed. 

However, this architecture imposes some constraints on the functions the network 
can represent; for instance, it can only represent volume-preserving mappings. 
Follow-up work [180] addressed this limitation by introducing a new reversible 
transformation. More specifically, they extend the space of such models using 
real-valued non-volume-preserving (real NVP) transformations using the following 
operation [180]: 


yiI=*%*1, 
y2 = x2 © exp(s(x1)) + #1), (13.78) 


where s denotes point-wise scaling, f is referred to as a translation network, and © is 
the element-wise multiplication. Then, the corresponding Jacobian matrix is given 
by 


dy | La 0 
ax E eyeeceoy asa 


Given the observation that this Jacobian is triangular, we can efficiently compute 
its determinant as 


det (72) = exp Yoscalid |. (13.80) 


J 


where x;[j] denotes the j-th element of x;. The inverse of the transform can also 
be easily implemented by 


X1 = Ji, 


x2 = (y2 — t(y1)) © exp(—s(y1)). (13.81) 


The corresponding block architecture is illustrated in Fig. 13.9. 
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Fig. 13.9 Forward and inverse architecture of a building block in real NVP transform [180]. (a) 
Forward propagation. (b) Inverse propagation 


Fig. 13.10 Example of normalizing flow using GLOW [181]. Figure courtesy of https://openai. 
com/blog/glow/ 


Due to the successive applications of transforms, one of the important advantages 
of NF is the gradual changes of the distribution. Figure 13.10 shows examples using 
GLOW—\the generative flow using | x 1 invertible convolution [181]. As the name 
indicates, GLOW has additional | x | invertible convolution blocks to increase the 
expressiveness of the network. 


13.7 Unsupervised Learning via Image Translation 


So far, we have discussed generative models that generate samples from noise. 
Generative models are also useful to convert one distribution to another. This is 
why generative models become the main workhorse for unsupervised learning tasks. 
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Among the various unsupervised learning tasks, in this section we are mainly 
focusing on image translation, which is a very active area of research. 


13.7.1 Pix2pix 


Pix2pix [194] was presented in 2016 by researchers from Berkeley in their work 
“Tmage-to-Image Translation with Conditional Adversarial Networks.” This is not 
unsupervised learning per se, as it requires matched data sets, but it opens a new era 
of image translation, so we review this here. 

Most of the problems in image processing and computer vision can be posed 
as “translating” an input image into a corresponding output image. For example, a 
scene may be rendered as an RGB image, a gradient field, an edge map, a semantic 
label map, etc. In analogy to automatic language translation, we define automatic 
image-to-image translation as the task of translating one possible representation of 
a scene into another, given a large amount of training data. 

Pix2pix uses a generative adversarial network (GAN) [88] to learn a function 
to map from an input image to an output image. The network is made up of two 
main pieces, the generator, and the discriminator. The generator transforms the input 
image to get the output image. The discriminator measures the similarity of the 
generated image to the target image from the data set, and tries to guess if this was 
produced by the generator. 

For example, in Fig. 13.11, the generator produces a photo-realistic shoe image 
from a sketch, and the discriminator tries to differentiate whether the generated 
images are the real photo from the sketch or the fake one. 

The nice thing about pix2pix is that it is generic and does not require the user to 
define any relationship between the two types of images. It makes no assumptions 
about the relationship and instead learns the objective during training, by comparing 
the defined inputs and outputs during training and inferring the objective. This 
makes pix2pix highly adaptable to a wide variety of situations, including ones where 
it is not easy to verbally or explicitly define the task we want to model. 


P G(x) | 
“be BS 


Fig. 13.11 Discriminator concept in pix2pix 
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That said, one downside of pix2pix is that it requires paired data sets to learn 
their relationship, and these are often difficult to obtain in practice. This issue is 
largely addressed by cycleGAN [185], which is the topic of the following section. 


13.7.2 CycleGAN 


Image-to-image translation is an important task in computer vision and graphics 
problems. Examples include: 


¢ Translating summer landscapes to winter landscapes (or the reverse). 
¢ Translating paintings to photographs (or the reverse). 
e Translating horses to zebras (or the reverse). 


As discussed before, pix2pix [194] is designed for such tasks, but it requires 
paired examples, specifically, a large data set of many examples of input images 
in the domain X (e.g. sketches of shoes) and the same images with the desired 
modification that can be used as the expected output images in Y (e.g. photos of 
shoes) (see the left column of Fig. 13.12). The requirement for a paired training data 
set is a limitation. These data sets are challenging and are even impossible to collect, 
e.g. photos of zebras and horses with exactly the same poses, size, etc. 

Rather the unpaired situation in Fig. 13.12 is more realistic, where the collection 
of the images in X (for example, photos) and the unpaired collection of images 
in Y (for example, Monet’s paintings) are available. Then, the goal of the image 
translation is to convert the distribution in X and Y and vice versa. In fact, the 
cycleGAN by Zhu et al. [185] demonstrated that such unpaired image translation is 
indeed possible. 

The cycleGAN problem nicely fits into our geometric view of the autoencoder in 
Fig. 13.2, which is redrawn in Fig. 13.13 using a domain Y. Accordingly, optimal 
transport (OT) [182, 184] provides a rigorous mathematical tool to understand the 
geometry of unsupervised learning by cycleGAN. 


Paired Unpaired 


Fig. 13.12 Paired vs. unpaired image translation 
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Fig. 13.13 Geometric view of CycleGAN-based unsupervised learning 


Here, the target image space X is equipped with a probability measure 11, whereas 
the original image space Y has a probability measure v. Since there are no paired 
data, the goal of unsupervised learning is to match the probability distributions 
rather than each individual sample. This can be done by finding transportation 
maps that transport the measure jz to v, and vice versa. More specifically, the 
transportation from a measure space (Y, v) to another measure space (X, j2) is done 
by a generator Gg : Y + 4X, realized by a deep network parameterized with 6. 
Then, the generator Gg “pushes forward” the measure v in Y to a measure jg 
in the target space X [182, 184]. Similarly, the transport from (X, 4) to (Y, v) 
is performed by another neural network generator Fy, so that the generator Fy 
pushes forward the measure in X to vg in the original space Y. Then, the 
optimal transport map for unsupervised learning can be achieved by minimizing 
the statistical distances dist(j1g, 4) between jz and jug, and dist(vg, v) between v 
and vg, and our proposal is to use the Wasserstein-1 metric as a means to measure 
the statistical distance. 

More specifically, for the choice of a metric d(x, x’) = ||x — x’|| in X, the 
Wasserstein-1 metric between yz and jug can be computed by Villani [182], Peyré et 
al. [184] 


Wi(u, We) = inf / lx — Ge(y)lldx(, y). (13.82) 
weN(u,y) JXxy 


Similarly, the Wasserstein-1 distance between v and vg is given by 


Wi(v, vg) = inf i Fo (x) — ylld(a, y). (13.83) 
mel(u,v) JXxy 
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Rather than minimizing (13.82) and (13.83) separately with distinct joint distribu- 
tions, a better way of finding the transportation map is to minimize them together 
with the same joint distribution z: 


inf / lx — Ge(y)Il + I Fo(x) — ylldm(x, 9). (13.84) 
well (u,v) IXY 


One of the most important contributions of [195] is to show that the primal 


formulation of the unsupervised learning in (13.84) can be represented by a dual 
formulation: 


min max lcycieGAn (9, @; VW, Y), (13.85) 
¢.0 Wp 


where 
LeyeleGAN (6,650, 9) = Meycle (0, d) + Evisc(O, o; VW, Q), (13.86) 


where A > 0 is the hyper-parameter, and the cycle-consistency term is given by 
Lcycle (9, ) =| I|x — Go (Fo(x))I|du() 
+ [Wy FeGo advo, 
whereas the second term is 
lnisc(O, b; W, Y) =max f o(sdnts) = [ eGoonavi) 
oo max f Yoydv09 = I W(Fo(x))du(a). (13.87) 


Here, y, ¥ are often called Kantorovich potentials and satisfy the 1-Lipschitz 
condition (i.e. 


lox) — p(x’)| < |lx — x’ I], Vx, x" EX, 
Iv(y) — vy) < lly — yl, Vy, ey. 


In the machine learning context, the 1-Lipschitz potentials g and w correspond to 
the Wasserstein-GAN (W-GAN) discriminators [177]. Specifically, @ corresponds 
to a discriminator to differentiate fake samples in real and generated images in 
X, whereas w is a discriminator to tell the fake and real samples in the domain 
Y. Moreover, the cycle-consistency term Lcycle Works to impose the one-to-one 
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Fig. 13.14 CycleGAN network architecture 


Monet 


Fig. 13.15 Unsupervised style transfer in paintings 


correspondence between the original and target domain, removing the mode- 
collapsing behaviors of GANs. The corresponding network architecture can be 
represented in Fig. 13.14. Specifically, y tries to find the difference between the 
true image x and the generated image Gg(y), whereas y attempts to find the 
fake measurement data that are generated by the synthetic measurement procedure 
F g(x). In fact, this formulation is equivalent to the cycleGAN formulation [185] 
except for the use of 1-Lipschitz discriminators. 

CycleGAN has been very successful for various unsupervised learning tasks. 
Figure 13.15 shows the examples of unsupervised style transfers between two 
different styles of paintings. 


13.7.3 StarGAN 


In Fig. 13.15, one downside of cycleGAN is that we need to train separate generators 
for each pair of domains. For example, if there are N different styles in the 
paintings, there should be N(N — 1) distinct generators to translate the images (see 
Fig. 13.16a). 
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(a) Cross-domain models (b) StarGAN 


G1 G42 


G34 G43 


Fig. 13.16 Multi-domain translation: (a) cycleGAN, and (b) starGAN 


To overcome the limitations of the scalability of cycleGAN, starGAN was 
proposed [87]. Specifically, as shown in Fig. 13.16b, one generator is trained such 
that it can translate into multiple domains by adding a mask vector that represents 
a target domain. This mask vector is augmented along the channel direction using 
one-hot vector encoding. 

Given training data from two different domains, these models learn to translate 
images from one domain to the other. For example, changing the hair color 
(attribute) of a person from black (attribute value) to blond (attribute value). We 
denote a domain as a set of images sharing the same attribute value. People with 
black hair compose one domain and people with blond hair compose another 
domain. Here, the discriminator has two things to do. It should be able to identify 
whether an image is fake or not. With the help of an auxiliary classifier, the 
discriminator can also predict the domain of the image given as input to the 
discriminator (see Fig. 13.17). 

With the auxiliary classifier, the discriminator learns the mapping of the original 
image and its corresponding domain from the data set. When the generator generates 
a new image conditioned on a target domain c (say blond hair), the discriminator 
can predict the generated image’s domain so G will generate new images till the 
discriminator can predict it as target domain c (blond hair). Figure 13.18 shows 
such an example of multi-domain translation using a single starGAN generator. 


13.7.4 Collaborative GAN 


In many applications requiring multiple inputs to obtain the desired output, if any of 
the input data is missing, it often introduces large amounts of bias. Although many 
techniques have been developed for imputing missing data, image imputation is still 
difficult due to the complicated nature of natural images. To address this problem, a 
novel framework collaborative GAN (CollaGAN) [186] was proposed. 
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Input Black hair Blondhair Brown hair Gender Aged 


Fig. 13.18 Examples of multi-domain translation using a single StarGAN generator 


Specifically, CollaGAN converts an image imputation problem to a multi- 
domain images-to-image translation task so that a single generator and discriminator 
network can successfully estimate the missing data using the remaining clean data 
set. More specifically, CycleGAN and StarGAN are interested in transferring one 
image to another, as shown in Fig. 13.19a,b without considering the remaining 
domain data set. However, in image imputation problems, the missing data occurs 
infrequently, and the goal is to estimate the missing data by utilizing the other clean 
data set. Therefore, an image imputation problem can be correctly described as in 
Fig. 13.19c, where one generator can estimate the missing data using the remaining 
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Fig. 13.19 Comparison with various multi-domain translation architecture. (a) Cross-domain 
models. (b) StarGAN. (c) Collaborative GAN 
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Fig. 13.20 CollaGAN generator and discriminator architecture 


clean data set. Since the missing data domain is not difficult to estimate a priori, the 
imputation algorithm should be designed such that one algorithm can estimate the 
missing data in any domain by exploiting the data for the rest of the domains. 

Due to the specific applications, CollaGAN is not an unsupervised learning 
method. However, one of the key concepts in CollaGAN is the cycle consistency 
for multiple inputs, which is useful for other applications. Specifically, since the 
inputs are multiple images, the cycle loss should be redefined. In particular, for the 
N-domain data, from a generated output, we should be able to generate N — 1 new 
combinations as the other inputs for the backward flow of the generator (Fig. 13.20 
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Fig. 13.21 Missing image imputation results from CollaGAN 


middle). For example, when N = 4, there are three combinations of multi-input and 
single-output so that we can reconstruct the three images of original domains using 
backward flow of the generator. In regard to the discriminator, the discriminator 
should have a classifier header as well as the discriminator part similar to that of 
StarGAN. 

Figure 13.21 shows an example of missing domain imputation, where CollaGAN 
produces very realistic images. 


13.8 Summary and Outlook 


So far we have discussed exciting field of deep learning—generative models. This 
is nonetheless an inclusive review, as there are so many exciting other algorithms. 
Here, the main emphasis is to provide a unified mathematical view to understand 
the various algorithms. As emphasized in the chapter, this field is important not only 
due to the fancy applications, but also from firm mathematical backgrounds that are 
grounded. As Yann LeCun said, unsupervised learning is the core of deep learning, 
so there will be many exciting new applications and opportunities for developing 
new theory, so young researchers are invited to participate in this exciting field. 


13.9 Exercises 
1. Show the following equality: 


1 1 
Dyis(P\|Q) = 5 Dxi(PllM) + 5 Dei (QllM), (13.88) 


where M = (P + Q)/2. 


312 


10. 
11. 
12. 
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. Show that for JS divergence, absolute continuity is not necessary. 
. For the following generator function f(u), derive (1) the f-divergence form, 


and (2) the f-GAN formulation from the definition of f-divergence using 
convex dual. 


(a) f(u) = w+ I log 4; + ulogu. 


(b) f(u) = ulogu. 
(c) fu) = 1). 


. Let uw and v denote 1-D probability measures with the cumulative distribution 


functions F and G, respectively. Show that the Wasserstein-p distance between 
and v is given by (13.21). 


. Prove Eq. (13.22). 
. Prove Eq. (13.26). 
. Derive the optimal transport map T in (13.27) between two Gaussian distribu- 


tions. 


. Show that the AdaIN can be interpreted as the optimal transport between two 


i.i.d. Gaussian distributions. 


. Let the transport cost c(x, y) : XxY — RU{oo} be given by c(x, y) = h(x—y) 


with h strictly convex. 


(a) Show that there exists a Kantorovich potential g such that the optimal 
transport plan T that transports the measure yz in X to v in Y can be 
represented as 


T(x) =x — (Vh)!Vo(x). (13.89) 


where (Vh)~! denotes the inverse function of VA. 
(b) Asa special case, if h(x — y) = llx — y||?, show that the optimal transport 
map can be represented by 


y =T(x) = Vu(x), 


where u(x) := e°/2 — g(x) is convex for some function g(x). 


Prove (13.64). 
Prove (13.66). 
For the given reparametrization trick in the VAE 


z= Fg (u) = ues) + og(x) Ou, u~ NO, Ia), (13.90) 
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13. 
14. 


where x € R”,z,u € R@ and Le), 0¢(-) : R? R¢ and © is the element- 
wise multiplication, show the following equality: 


= f tog 


where r(u) is the probability density function. 
What are the advantages and disadvantages of the B-VAE over the VAE? 
Consider the NICE update for the normalizing flow given by 


u 


OFS (u) ine 
Y) 
oa ( F ) ro = =5 oz logo? (x), 


y=xX1, Yr =xX2+ F(y1). (13.91) 


(a) Why does the Jacobian term become the identity? Please derive explicitly. 
(b) Suppose we are interested in a more expressive network given by 


y= x1+G(x2), yo = x2 + F(y1) (13.92) 


for some function G. What is the inverse operation? How can you make the 
corresponding normalizing flow cost function simple in terms of Jacobian 
calculation? You may want to split the update into two steps to simplify the 
derivation. 


Chapter 14 M®) 
Summary and Outlook cen 


With the tremendous success of deep learning in recent years, the field of data 
science has undergone unprecedented changes that can be considered a “revolution”. 
Despite the great successes of deep learning in various areas, there is a tremendous 
lack of rigorous mathematical foundations which enable us to understand why deep 
learning methods perform well. In fact, the recent development of deep learning 
is largely empirical, and the theory that explains the success remains seriously 
behind. For this reason, until recently, deep learning was viewed as pseudoscience 
by rigorous scientists, including mathematicians. 

In fact, the success of deep learning appears very mysterious. Although sophis- 
ticated network architectures have been proposed by many researchers in recent 
years, the basic building blocks of deep neural networks are the convolution, pooling 
and nonlinearity, which from a mathematical point of view are regarded as very 
primitive tools from the “Stone Age”. However, one of the most mysterious aspects 
of deep learning is that the cascaded connection of these “Stone Age” tools results 
in superior performance that far exceeds the sophisticated mathematical tools. 
Nowadays, in order to develop high-performance data processing algorithms, we 
do not have to hire highly educated doctoral students or postdocs, but only give 
TensorFlow and many training data to undergraduate students. Does it mean a dark 
age of mathematics? Then, what is the role of the mathematicians in this data-driven 
world? 

A popular explanation for the success of deep neural networks is that the neural 
network was developed by mimicking the human brain and is therefore destined 
for success. In fact, as discussed in Chap.5, one of the most famous numerical 
experiments is the emergence of the hierarchical features from a deep neural 
network when it is trained to classify human faces. Interestingly, this phenomenon 
is similarly observed in human brains, where hierarchical features of the objects 
emerge during visual information processing. Based on these numerical observa- 
tions, some of the artificial neural network “hardliners” even claim that instead 
of mathematics we need to investigate the biology of the brain to design more 
sophisticated artificial neural networks and to understand the working principle of 
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artificial neural networks. However, when neuroscientists (especially computational 
neuroscientists) were asked why the brain extracts such hierarchical features, it was 
surprising to find out that they usually rely on numerical simulations with artificial 
neural networks to explain how hierarchical properties arise in the brain. From a 
mathematical point of view, this is a typical example of “circular proof”, an apparent 
logical fallacy. 

Then, how can we fill in the gap between empirical success and the lack of 
the theory? In fact, one of the lessons we learn from the history of science is 
that the gap between the empirical observation and the lack of theory is not 
the limiting factor, but suggests the birth of a new science. For example, during 
the “golden age of physics” in the early twentieth century, some of the most 
exciting empirical discoveries in physics were quantum phenomena. Experimental 
physicists discovered many exotic quantum phenomena that could not be explained 
by either Newtonian or relativistic physics. In fact, there was a serious lag in 
the theoretical physics that could explain newly discovered quantum phenomena. 
Mathematical models were further developed, questioned, and refuted by the 
empirical observations. Even the greatest Albert Einstein said that he could not 
believe quantum physics since “God does not play dice with the universe.” During 
these intense intellectual efforts to explain the seemingly unexplainable empirical 
observations, the new theory of quantum mechanics was rigorously formed, which 
led to numerous Nobel laureates; and new mathematics such as functional analysis, 
harmonic analysis, etc., has become mainstream in the modern mathematics. In 
fact, these efforts by scientists completely changed the landscape of physics and 
mathematics. 

Similarly, now there is a great need to develop mathematical theories to explain 
the enormous empirical success of deep neural networks. In fact, computer scientists 
and engineers who work on the implementation are like the experimental physicists 
who give endless inspiration, and the mathematicians and signal processors are like 
theoretical physicists who try to find the unified mathematical theory to explain 
the empirical discoveries. Therefore, contrary to the false belief that we are in the 
dark age of mathematics, we are now actually living in the “golden age’, ready to 
discover the beautiful mathematical theory of deep learning that can completely 
change the field of mathematics. Therefore, this book has aimed to explore the 
mathematical theory of deep learning to crack open the black box of deep learning 
and open a new age of mathematics. 

The field of deep learning is interdisciplinary in nature, and includes mathemat- 
ics, data science, physics, biology, medicine, etc. Therefore, collaborative research 
efforts between mathematics and other fields are crucial. This is because empirical 
results not only give the inspiration for the mathematical theory, but provide a 
means to verify whether a mathematical theory is correct. Therefore, although this 
book primarily focuses on discovering the fundamental mathematical principles 
of deep learning, it is hoped that it will play an instrumental role promoting the 
basic sciences in physics, biology, chemistry, geophysics, etc. using deep learning, 
and enable readers to be inspired by new empirical problems to obtain better 
mathematical models. 
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